You are on page 1of 191

• Data Science General 1 – 10

o What is Data Science? 1


o Types of Data 1
o Main Types of Problems 1
o Probability Overview 1
o Descriptive Statistics 1
o Data Cleaning 2
o Feature Engineering 2
o Statistical Analysis 2
o Classic Statistical Distributions 3
o Modeling 3–4
§ Overview 3
§ Philosophies 3
§ Taxonomy 4
§ Evaluation Metrics 4
§ Evaluation Environment 4
o Linear Regression 5
o Linear Regression II 5
o Logistic Regression 5
o Distance / Network Methods 6
o Nearest Neighbor Classification 6
o Clustering 6
o Machine Learning I 7
o Maching Learning II 7
o Maching Learning III 7
o Machine Learning IV 8
o Deep Learning I 8
o Deep Learning II 8
o Deep Learning III 9
o Big Data 9
§ Hadoop Overview 9
o SQL I 10
o Python – Data Structures 10

• Business Science 11 – 14
o Problem Framework 11
o DS Python Workflow 12
o DS R Workflow 13
• Jupyter Notebook 15
• Python for Data Science 16 – 63
o Dataquest 16 – 20
§ Python Basics 16
§ Python Intermediate 17
§ NumPy 18
§ Pandas 19
§ Regular Expressions 20
o Datacamp 21 – 34
§ Importing Data 21
§ Python Basics 22
§ Numpy Basics 23
§ Pandas Basics 24
§ Pandas 25
§ Matplotlib 26
§ Seaborn 27
§ Bokeh 28
§ SciPy – Linear Algebra 29
§ PySpark – SQL Basics 30
§ PySpark – RDD Basics 31
§ Scikit – Learn 32
§ Keras 33
§ spaCY 34 – 35
o Python III 35 – 36
o Python Crash Course 38 – 63
§ Intro 38 – 39
§ Lists 40 – 41
§ Dictionaries 42 – 43
§ If Statements & While Loops 44 – 45
§ Functions 46 – 47
§ Classes 48 – 49
§ Files & Exceptions 50 – 51
§ Testing Code 52 – 53
§ Pygame 54 – 55
§ Matplotlib 56- 57
§ Pygal 58 -59
§ Django 60- 63
• Command Line 63 – 74
o Linux Bash 63 – 70
o DVC 71
o GIT 72 – 74
• SQL 75 – 80
• Math 81 – 106
95
o Probabilities & Statistics 81 – 83
o Probability 84 – 93
§ Table of Distributions 93
o Linear Algebra & Calculus 94 – 95
o Calculus 96 – 106
• Machine Learning 107 – 121
o Choosing a Model 107
o Supervised Learning 108 – 113
§ Regression 108 – 109, 110 – 112
§ Classification 110 – 113
o Unsupervised Learning 114 – 118
§ Clustering 114 – 115
§ Unsupervised Learning 116 – 118
o Tips & Tricks 119 – 121
• Deep Learning 122 – 139
o Keras 122
o Neural Network Cells 123
o Neural Network Graphs 124
o Deep Learning 125 – 126
o Recurrent Neural Networks 127 – 131
o Convolutional Neural Networks 132 – 136
o Tips & Tricks 137 – 139
• Big Data 140 – 143
o Dask 140 – 141
o PySpark 142 – 143
• Natural Language Processing 144 – 185
o NLP 144 – 146
o Text Analysis with NLTK 147 – 149
o NLP Cheat Sheet 150 – 151
o Guide to NLP 152 - 185
Data Science Cheatsheet
Compiled by Maverick Lin (http://mavericklin.com)
Last Updated August 13, 2018

What is Data Science? Probability Overview Descriptive Statistics


Multi-disciplinary field that brings together concepts Probability theory provides a framework for reasoning Provides a way of capturing a given data set or sample.
from computer science, statistics/machine learning, and about likelihood of events. There are two main types: centrality and variability
data analysis to understand and extract insights from the measures.
ever-increasing amounts of data. Terminology
Experiment: procedure that yields one of a possible set Centrality
Two paradigms of data research. of outcomes e.g. repeatedly tossing a die or coin Arithmetic Mean Useful to characterize symmetric
1. Hypothesis-Driven: Given a problem, what kind Sample Space S: set of possible outcomes of an experi- distributions without outliers µX = n1 x
of data do we need to help solve it? ment e.g. if tossing a die, S = (1,2,3,4,5,6 Geometric Mean Useful for averaging ratios. Always
P

2. Data-Driven: Given some data, what interesting Event E: set of outcomes of an experiment e.g. event less than arithmetic mean = n a1 a2 ...a3
p
problems can be solved with it? that a roll is 5, or the event that sum of 2 rolls is 7 Median Exact middle value among a dataset. Useful for
Probability of an Outcome s or P(s): number that skewed distribution or data with outliers.
The heart of data science is to always ask questions. Al- satisfies 2 properties Mode Most frequent element in a dataset.
ways be curious about the world. for each outcome s, 0  P(s)  1
1. What can we learn from this data? 2. p(s) = 1 Variability
2. What actions can we take once we find whatever it Standard Deviation Measures the squares di↵erences
1. P

is we are looking for? probabilities of the between the individual elements and the mean
2
outcomes of the experiment: p(E) = s⇢E p(s) i=1 (xi x)
= N 1
Types of Data Random Variable V: numerical function on the out-
q PN

Variance V = 2
Probability of Event E: sum of theP

comes of a probability space


Structured: Data that has predefined structures. e.g.
Expected Value of Random Variable V: E(V) = Interpreting Variance
tables, spreadsheets, or relational databases.
Unstructured Data: Data with no predefined struc- s⇢S p(s) * V(s) Variance is an inherent part of the universe. It is impossi-
ture, comes in any size or form, cannot be easily stored ble to obtain the same results after repeated observations
P

Independence, Conditional, Compound of the same event due to random noise/error. Variance
in tables. e.g. blobs of text, images, audio
Independent Events: A and B are independent i↵: can be explained away by attributing to sampling or
Quantitative Data: Numerical. e.g. height, weight
Categorical Data: Data that can be labeled or divided
P(A \ B) = P(A)P(B) measurement errors. Other times, the variance is due to
P(A|B) = P(A) the random fluctuations of the universe.
into groups. e.g. race, sex, hair color.
P(B|A) = P(B)
Big Data: Massive datasets, or data that contains
; Conditional Probability: P(A|B) = P(A,B)/P(B) Correlation Analysis
greater variety arriving in increasing volumes and with
Bayes Theorem: P(A|B) = P(B|A)P(A)/P(B) Correlation coefficients r(X,Y) is a statistic that measures
ever-higher velocity (3 Vs). Cannot fit in the memory of
Joint Probability: P(A,B) = P(B|A)P(A) the degree that Y is a function of X and vice versa.
a single machine.
Marginal Probability: P(A) Correlation values range from -1 to 1, where 1 means
Data Sources/Fomats fully correlated, -1 means negatively-correlated, and 0
Probability Distributions means no correlation.
Most Common Data Formats CSV, XML, SQL,
Probability Density Function (PDF) Gives the prob- Pearson Coefficient Measures the degree of the rela-
JSON, Protocol Bu↵ers
ability that a rv takes on the value x: pX (x) = P (X = x) tionship between linearly related variables
Data Sources Companies/Proprietary Data, APIs, Gov-
Cumulative Density Function (CDF) Gives the prob- )
ernment, Academic, Web Scraping/Crawling r = Cov(X,Y
(X) (Y )
ability that a random variable is less than or equal to x: Spearman Rank Coefficient Computed on ranks and
Main Types of Problems
FX (x) = P (X  x) depicts monotonic relationships
Note: The PDF and the CDF of a given random variable
Two problems arise repeatedly in data science. contain exactly the same information. Note: Correlation does not imply causation!
Classification: Assigning something to a discrete set of
possibilities. e.g. spam or non-spam, Democrat or Repub-
lican, blood type (A, B, AB, O)
Regression: Predicting a numerical value. e.g. some-
one’s income, next year GDP, stock price
1
Data Cleaning Feature Engineering Statistical Analysis
Data Cleaning is the process of turning raw data into Feature engineering is the process of using domain knowl- Process of statistical reasoning: there is an underlying
a clean and analyzable data set. ”Garbage in, garbage edge to create features or input variables that help ma- population of possible things we can potentially observe
out.” Make sure garbage doesn’t get put in. chine learning algorithms perform better. Done correctly, and only a small subset of them are actually sampled (ide-
it can help increase the predictive power of your models. ally at random). Probability theory describes what prop-
Errors vs. Artifacts Feature engineering is more of an art than science. FE is erties our sample should have given the properties of the
1. Errors: information that is lost during acquisi- one of the most important steps in creating a good model. population, but statistical inference allows us to deduce
tion and can never be recovered e.g. power outage, As Andrew Ng puts it: what the full population is like after analyzing the sample.
crashed servers
“Coming up with features is difficult, time-consuming,
2. Artifacts: systematic problems that arise from
requires expert knowledge. ‘Applied machine learning’ is
the data cleaning process. these problems can be
basically feature engineering.”
corrected but we must first discover them
Continuous Data
Data Compatibility Raw Measures: data that hasn’t been transformed yet
Data compatibility problems arise when merging datasets. Rounding: sometimes precision is noise; round to
Make sure you are comparing ”apples to apples” and nearest integer, decimal etc..
not ”apples to oranges”. Main types of conver- Scaling: log, z-score, minmax scale
sions/unifications: Imputation: fill in missing values using mean, median, Sampling From Distributions
• units (metric vs. imperial) model output, etc.. Inverse Transform Sampling Sampling points from
• numbers (decimals vs. integers), Binning: transforming numeric features into categorical a given probability distribution is sometimes necessary
• names (John Smith vs. Smith, John), ones (or binned) e.g. values between 1-10 belong to A, to run simulations or whether your data fits a particular
• time/dates (UNIX vs. UTC vs. GMT), between 10-20 belong to B, etc. distribution. The general technique is called inverse
• currency (currency type, inflation-adjusted, divi- Interactions: interactions between features: e.g. sub- transform sampling or Smirnov transform. First draw
dends) traction, addition, multiplication, statistical test a random number p between [0,1]. Compute value x
Statistical: log/power transform (helps turn skewed such that the CDF equals p: FX (x) = p. Use x as the
Data Imputation distributions more normal), Box-Cox value to be the random value drawn from the distribution
Process of dealing with missing values. The proper meth- Row Statistics: number of NaN’s, 0’s, negative values, described by FX (x).
ods depend on the type of data we are working with. Gen- max, min, etc
eral methods include: Dimensionality Reduction: using PCA, clustering, Monte Carlo Sampling In higher dimensions, correctly
• Drop all records containing missing data factor analysis etc sampling from a given distribution becomes more tricky.
• Heuristic-Based: make a reasonable guess based on Generally want to use Monte Carlo methods, which
knowledge of the underlying domain Discrete Data typically follow these rules: define a domain of possible
• Mean Value: fill in missing data with the mean Encoding: since some ML algorithms cannot work on inputs, generate random inputs from a probability
• Random Value categorical data, we need to turn categorical data into nu- distribution over the domain, perform a deterministic
• Nearest Neighbor: fill in missing data using similar merical data or vectors calculation, and analyze the results.
data points Ordinal Values: convert each distinct feature into a ran-
• Interpolation: use a method like linear regression to dom number (e.g. [r,g,b] becomes [1,2,3])
predict the value of the missing data One-Hot Encoding: each of the m features becomes a
vector of length m with containing only one 1 (e.g. [r, g,
Outlier Detection b] becomes [[1,0,0],[0,1,0],[0,0,1]])
Outliers can interfere with analysis and often arise from Feature Hashing Scheme: turns arbitrary features into
mistakes during data collection. It makes sense to run a indices in a vector or matrix
”sanity check”. Embeddings: if using words, convert words to vectors
(word embeddings)
Miscellaneous
Lowercasing, removing non-alphanumeric, repairing,
unidecode, removing unknown characters

Note: When cleaning data, always maintain both the raw


data and the cleaned version(s). The raw data should be
kept intact and preserved for future use. Any type of data
cleaning/analysis should be done on a copy of the raw
2

data.
Classic Statistical Distributions Modeling- Overview Modeling- Philosophies
Binomial Distribution (Discrete) Modeling is the process of incorporating information into Modeling is the process of incorporating information
Assume X is distributed Bin(n,p). X is the number of a tool which can forecast and make predictions. Usually, into a tool which can forecast and make predictions.
”successes” that we will achieve in n independent trials, we are dealing with statistical modeling where we want Designing and validating models is important, as well as
where each trial is either a success or failure and each to analyze relationships between variables. Formally, we evaluating the performance of models. Note that the best
success occurs with the same probability p and each want to estimate a function f (X) such that: forecasting model may not be the most accurate one.
failure occurs with probability q=1-p.
Y = f (X) + ✏
PDF: P (X = x) = nk px (1 p)n x Philosophies of Modeling
EV: µ = np Variance = npq where X = (X1 , X2 , ...Xp ) represents the input variables, Occam’s Razor Philosophical principle that the simplest
Y represents the output variable, and ✏ represents random explanation is the best explanation. In modeling, if we
Normal/Gaussian Distribution (Continuous) error. are given two models that predicts equally well, we should
Assume X in distributed N (µ, 2 ). It is a bell-shaped choose the simpler one. Choosing the more complex one
and symmetric distribution. Bulk of the values lie close Statistical learning is set of approaches for estimating can often result in overfitting.
to the mean and no value is too extreme. Generalization this f (X). Bias Variance Trade-O↵ Inherent part of predictive
of the binomial distribution2 as 2n ! 1. modeling, where models with lower bias will have higher
PDF: P (x) = p12⇡ e (x µ) /2 Why Estimate f(X)? variance and vice versa. Goal is to achieve low bias and
EV: µ Variance: 2 Prediction: once we have a good estimate fˆ(X), we can low variance.
Implications: 68%-95%-99% rule. 68% of probability use it to make predictions on new data. We treat fˆ as a • Bias: error from incorrect assumptions to make tar-
mass fall within 1 of the mean, 95% within 2 , and black box, since we only care about the accuracy of the get function easier to learn (high bias ! missing rel-
99.7% within 3 . predictions, not why or how it works. evant relations or underfitting)
Inference: we want to understand the relationship • Variance: error from sensitivity to fluctuations in
Poisson Distribution (Discrete) between X and Y. We can no longer treat fˆ as a black the dataset, or how much the target estimate would
Assume X is distributed Pois( ). Poisson expresses box since we want to understand how Y changes with di↵er if di↵erent training data was used (high vari-
the probability of a given number of events occurring respect to X = (X1 , X2 , ...Xp ) ance ! modeling noise or overfitting)
in a fixed interval of time/space if these events occur
independently and with a known constant rate . More About ✏
x
PDF: P (x) = e x! EV: Variance = The error term ✏ is composed of the reducible and irre-
ducible error, which will prevent us from ever obtaining a
Power Law Distributions (Discrete) perfect fˆ estimate.
Many data distributions have much longer tails than • Reducible: error that can potentially be reduced
the normal or Poisson distributions. In other words, by using the most appropriate statistical learning
the change in one quantity varies as a power of another technique to estimate f . The goal is to minimize
quantity. It helps measure the inequality in the world. the reducible error.
e.g. wealth, word frequency and Pareto Principle (80/20 • Irreducible: error that cannot be reduced no
No Free Lunch Theorem No single machine learning
Rule) matter how well we estimate f . Irreducible error is
algorithm is better than all the others on all problems.
PDF: P(X=x) = cx ↵ , where ↵ is the law’s exponent unknown and unmeasurable and will always be an
It is common to try multiple models and find one that
and c is the normalizing constant upper bound for ✏.
works best for a particular problem.
Note: There will always be trade-o↵s between model
Thinking Like Nate Silver
flexibility (prediction) and model interpretability (infer-
1. Think Probabilistically Probabilistic forecasts are
ence). This is just another case of the bias-variance trade-
more meaningful than concrete statements and should be
o↵. Typically, as flexibility increases, interpretability de-
reported as probability distributions (including along
creases. Much of statistical learning/modeling is finding a
with mean prediction µ.
way to balance the two.
2. Incorporate New Information Use live models,
which continually updates using new information. To up-
date, use Bayesian reasoning to calculate how probabilities
change in response to new evidence.
3. Look For Consensus Forecast Use multiple distinct
sources of evidence. Ssome models operate this way, such
as boosting and bagging, which uses large number of weak
classifiers to produce a strong one.
3
Modeling- Taxonomy Modeling- Evaluation Metrics Modeling- Evaluation Environment
There are many di↵erent types of models. It is important Need to determine how good our model is. Best way to Evaluation metrics provides use with the tools to estimate
to understand the trade-o↵s and when to use a certain assess models is out-of-sample predictions (data points errors, but what should be the process to obtain the
type of model. your model has never seen). best estimate? Resampling involves repeatedly drawing
samples from a training set and refitting a model to each
Parametric vs. Nonparametric Classification sample, which provides us with additional information
• Parametric: models that first make an assumption Predicted Yes Predicted No
compared to fitting the model once, such as obtaining a
about a function form, or shape, of f (linear). Then Actual Yes True Positives (TP) False Negatives (FN) better estimate for the test error.
fits the model. This reduces estimating f to just Actual No False Positives (FP) True Negatives (TN)
estimating set of parameters, but if our assumption Key Concepts
Accuracy: ratio of correct predictions over total pre-
was wrong, will lead to bad results. Training Data: data used to fit your models or the set
dictions. Misleading when class sizes are substantially
• Non-Parametric: models that don’t make any as- P +T N used for learning
di↵erent. accuracy = T P +TT N +F N +F P
sumptions about f , which allows them to fit a wider Validation Data: data used to tune the parameters of
Precision: how often the classifier is correct when it
range of shapes; but may lead to overfitting P a model
predicts positive: precision = T PT+F P
Supervised vs. Unsupervised Test Data: data used to evaluate how good your model
Recall: how often the classifier is correct for all positive
• Supervised: models that fit input variables xi = P is. Ideally your model should never touch this data until
instances: recall = T PT+F N
(x1 , x2 , ...xn ) to a known output variables yi = final testing/evaluation
F-Score: single measurement to describe performance:
(y1 , y2 , ...yn ) precision·recall
Cross Validation
F = 2 · precision + recall
ROC Curves: plots true positive rates and false pos-
• Unsupervised: models that take in input variables
xi = (x1 , x2 , ...xn ), but they do not have an asso- Class of methods that estimate test error by holding out
itive rates for various thresholds, or where the model
ciated output to supervise the training. The goal a subset of training data from the fitting process.
determines if a data point is positive or negative (e.g. if
is understand relationships between the variables or Validation Set: split data into training set and valida-
>0.8, classify as positive). Best possible area under the
observations. tion set. Train model on training and estimate test error
ROC curve (AUC) is 1, while random is 0.5, or the main
Blackbox vs. Descriptive using validation. e.g. 80-20 split
diagonal line.
• Blackbox: models that make decisions, but we do Leave-One-Out CV (LOOCV): split data into
not know what happens ”under the hood” e.g. deep training set and validation set, but the validation set
Regression
learning, neural networks consists of 1 observation. Then repeat n-1 times until all
Errors are defined as the di↵erence between a prediction
observations have been used as validation. Test erro is
y0 and the actual result y.
• Descriptive: models that provide insight into why
they make their decisions e.g. linear regression, de- the average of these n test error estimates.
Absolute Error: = y0 y
cision trees k-Fold CV: randomly divide data into k groups (folds) of
Squared Error: 2 = (y0 y)2
First-Principle vs. Data-Driven approximately equal size. First fold is used as validation
Mean-Squared Error: M SE = n1 n i=1 (y0 yi ) 2
• First-Principle: models based on a prior belief of pi and the rest as training. Then repeat k times and find
how the system under investigation works, incorpo- Root Mean-Squared Error: RMSD = M SE average of the k estimates.
P

rates domain knowledge (ad-hoc) Absolute Error Distribution: Plot absolute error dis-
• Data-Driven: models based on observed correla- tribution: should be symmetric, centered around 0, bell- Bootstrapping
tions between input and output variables shaped, and contain rare extreme outliers. Methods that rely on random sampling with replacement.
Deterministic vs. Stochastic Bootstrapping helps with quantifying uncertainty associ-
• Deterministic: models that produce a single ”pre- ated with a given estimate or model.
diction” e.g. yes or no, true or false
• Stochastic: models that produce probability distri- Amplifying Small Data Sets
butions over possible events What can we do it we don’t have enough data?
Flat vs. Hierarchical • Create Negative Examples- e.g. classifying pres-
• Flat: models that solve problems on a single level, idential candidates, most people would be unquali-
no notion of subproblems fied so label most as unqualified
• Hierarchical: models that solve several di↵erent • Synthetic Data- create additional data by adding
nested subproblems noise to the real data
4
Linear Regression Linear Regression II Logistic Regression
Linear regression is a simple and useful tool for predicting Improving Linear Regression Logistic regression is used for classification, where the
a quantitative response. The relationship between input Subset/Feature Selection: approach involves identify- response variable is categorical rather than numerical.
variables X = (X1 , X2 , ...Xp ) and output variable Y takes ing a subset of the p predictors that we believe to be best
the form: related to the response. Then we fit model using the re- The model works by predicting the probability that Y be-
duced set of variables. longs to a particular category by first fitting the data to a
Y ⇡ 0 + 1 X1 + ... + p Xp +✏ • Best, Forward, and Backward Subset Selection linear regression model, which is then passed to the logis-
Shrinkage/Regularization: all variables are used, but tic function (below). The logistic function will always pro-
0 ... p are the unknown coefficients (parameters) which
estimated coefficients are shrunken towards zero relative duce a S-shaped curve, so regardless of X, we can always
we are trying to determine. The best coefficients
to the least squares estimate. represents the tuning obtain a sensible answer (between 0 and 1). If the prob-
will lead us to the best ”fit”, which can be found by
parameter- as ability is above a certain predetermined threshold (e.g.
minimizing the residual sum squares (RSS), or the
increases, flexibility decreases ! de-
creased variance but increased bias. The tuning parameter P(Yes) > 0.5), then the model will predict Yes.
is key in determining the sweet spot between under and e 0 + 1 X1 +...+ p Xp
the predicted ith value. RSS = n i=1 ei , where ei = yi yˆi p(X) =
over-fitting. In addition, while Ridge will always produce 1+e 0 + 1 X1 +...+ p Xp
a model with p variables, Lasso can force coefficients to
sum of the di↵erences betweenPthe actual ith value and

How to find best fit? How to find best coefficients?


be equal to zero.
Matrix Form: We can solve the closed-form equation for Maximum Likelihood: The coefficients 0 ... p are un-
coefficient vector w: w = (X T X) 1 X T Y . X represents known and must be estimated from the training data. We
• Lasso (L1): min RSS + | j|
p 2
the input data and Y represents the output data. This j=1 j seek estimates for 0 ... p such that the predicted proba-
Pp

• Ridge (L2): min RSS +


method is used for smaller matrices, since inverting a Dimension Reduction: projecting p predictors into a bility p̂(xi ) of each observation is a number close to one if
Pj=1

matrix is computationally expensive. M-dimensional subspace, where M < p. This is achieved its observed in a certain class and close to zero otherwise.
Gradient Descent: First-order optimization algorithm. by computing M di↵erent linear combinations of the This is done by maximizing the likelihood function:
We can find the minimum of a convex function by variables. Can use PCA.
starting at an arbitrary point and repeatedly take steps Miscellaneous: Removing outliers, feature scaling, l( 0 , 1 ) = p(xi ) (1 p(xi ))
removing multicollinearity (correlated variables) i:yi =1 i0 :yi0 =1
in the downward direction, which can be found by taking
Y Y

the negative direction of the gradient. After several Potential Issues


iterations, we will eventually converge to the minimum. Evaluating Model Accuracy
Imbalanced Classes: imbalance in classes in training
1
In our case, the minimum corresponds to the coefficients Residual Standard Error (RSE): RSE = n 2
RSS. data lead to poor classifiers. It can result in a lot of false
with the minimum error, or the best line of fit. The Generally, the smaller the better. positives and also lead to few training data. Solutions in-
q

learning rate ↵ determines the size of the steps we take R2 : Measure of fit that represents the proportion of clude forcing balanced data by removing observations from
in the downward direction. variance explained, or the variability in Y that can be the larger class, replicate data from the smaller class, or
explained using X. It takes on a value between 0 and 1. heavily weigh the training examples toward instances of
Gradient descent algorithm in two dimensions. Repeat RSS
Generally the higher the better. R2 = 1 T SS
, where the larger class.
until convergence. 2
Total Sum of Squares (TSS) = (yi ȳ) Multi-Class Classification: the more classes you try to
@
1. w0t+1 := w0t ↵ @w 0
J(w0 , w1 ) predict, the harder it will be for the the classifier to be ef-
t+1 t @
P

2. w1 := w1 ↵ @w J(w0 , w1 ) Evaluating Coefficient Estimates fective. It is possible with logistic regression, but another
1
Standard Error (SE) of the coefficients can be used to per- approach, such as Linear Discriminant Analysis (LDA),
For non-convex functions, gradient descent no longer guar- form hypothesis tests on the coefficients: may prove better.
antees an optimal solutions since there may be local min- H0 : No relationship between X and Y, Ha : Some rela-
imas. Instead, we should run the algorithm from di↵erent tionship exists. A p-value can be obtained and can be
starting points and use the best local minima we find for interpreted as follows: a small p-value indicates that a re-
the solution. lationship between the predictor (X) and the response (Y)
Stochastic Gradient Descent: instead of taking a step exists. Typical p-value cuto↵s are around 5 or 1 %.
after sampling the entire training set, we take a small
batch of training data at random to determine our next
step. Computationally more efficient and may lead to
faster convergence.
5
Distance/Network Methods Nearest Neighbor Classification Clustering
Interpreting examples as points in space provides a way Distance functions allow us to identify the points closest Clustering is the problem of grouping points by sim-
to find natural groupings or clusters among data e.g. to a given target, or the nearest neighbors (NN) to a ilarity using distance metrics, which ideally reflect the
which stars are the closest to our sun? Networks can also given point. The advantages of NN include simplicity, similarities you are looking for. Often items come from
be built from point sets (vertices) by connecting related interpretability and non-linearity. logical ”sources” and clustering is a good way to reveal
points. those origins. Perhaps the first thing to do with any
k-Nearest Neighbors data set. Possible applications include: hypothesis
Measuring Distances/Similarity Measure Given a positive integer k and a point x0 , the KNN development, modeling over smaller subsets of data, data
There are several ways of measuring distances between classifier first identifies k points in the training data reduction, outlier detection.
points a and b in d dimensions- with closer distances most similar to x0 , then estimates the conditional
implying similarity. probability of x0 being in class j as the fraction of K-Means Clustering
the k points whose values belong to j. The opti- Simple and elegant algorithm to partition a dataset into
d mal value for k can be found using cross validation. K distinct, non-overlapping clusters.
Minkowski Distance Metric: dk (a, b) = k i=1 |ai bi |k
The parameter k provides a way to tradeo↵ between the 1. Choose a K. Randomly assign a number between 1
qP

largest and the total dimensional di↵erence. In other and K to each observation. These serve as initial
words, larger values of k place more emphasis on large cluster assignments
di↵erences between feature values than smaller values. Se- 2. Iterate until cluster assignments stop changing
lecting the right k can significantly impact the the mean- (a) For each of the K clusters, compute the cluster
ingfulness of your distance function. The most popular centroid. The kth cluster centroid is the vector
values are 1 and 2. of the p feature means for the observations in
the kth cluster.
KNN Algorithm
• Manhattan (k=1): city block distance, or the sum
of the absolute di↵erence between two points (b) Assign each observation to the cluster whose
1. Compute distance D(a,b) from point b to all points centroid is closest (where closest is defined us-
2. Select k closest points and their labels
• Euclidean (k=2): straight line distance
ing distance metric).
3. Output class with most frequent labels in k points Since the results of the algorithm depends on the initial
Optimizing KNN random assignments, it is a good idea to repeat the
Comparing a query point a in d dimensions against n train- algorithm from di↵erent random initializations to obtain
ing examples computes with a runtime of O(nd), which the best overall results. Can use MSE to determine which
can cause lag as points reach millions or billions. Popular cluster assignment is better.
choices to speed up KNN include:
d Hierarchical Clustering
Weighted Minkowski: dk (a, b) = k
• Vernoi Diagrams: partitioning plane into regions
i=1 wi |ai based on distance to points in a specific subset of
bi |k , in
Alternative clustering algorithm that does not require us
some scenarios, not all dimensions are equal. Can convey the plane
qP

to commit to a particular K. Another advantage is that it


this idea using wi . Generally not a good idea- should • Grid Indexes: carve up space into d-dimensional results in a nice visualization called a dendrogram. Ob-
normalize data by Z-scores before computing distances. boxes or grids and calculate the NN in the same cell servations that fuse at bottom are similar, where those at
a·b as the point the top are quite di↵erent- we draw conclusions based on
, calculates the
the location on the vertical rather than horizontal axis.
Cosine Similarity: cos(a, b) = |a||b| • Locality Sensitive Hashing (LSH): abandons
similarity between 2 non-zero vectors, where a · b is the the idea of finding the exact nearest neighbors. In- 1. Begin with n observations and a measure of all the
dot product (normalized between 0 and 1), higher values stead, batch up nearby points to quickly find the (n)n 1
2
pairwise dissimilarities. Treat each observa-
imply more similar vectors most appropriate bucket B for our query point. LSH tion as its own cluster.
is defined by a hash function h(p) that takes a 2. For i = n, n-1, ...2
Kullback-Leibler Divergence: KL(A||B) = di=i ai log2 abii point/vector as input and produces a number/ code (a) Examine all pairwise inter-cluster dissimilari-
KL divergence measures the distances between probabil- as output, such that it is likely that h(a) = h(b) if ties among the i clusters and identify the pair
P

ity distributions by measuring the uncertainty gained or a and b are close to each other, and h(a)!= h(b) if of clusters that are least dissimilar ( most simi-
uncertainty lost when replacing distribution A with dis- they are far apart. lar). Fuse these two clusters. The dissimilarity
tribution B. However, this is not a metric but forms the
between these two clusters indicates height in
basis for the Jensen-Shannon Divergence Metric.
dendrogram where fusion should be placed.
(b) Assign each observation to the cluster whose
Jensen-Shannon: JS(A, B) = 12 KL(A||M )+ 12 KL(M ||B),
where M is the average of A and B. The JS function is the
centroid is closest (where closest is defined us-
right metric for calculating distances between probability
ing distance metric).
distributions
Linkage: Complete (max dissimilarity), Single (min), Av-
erage, Centroid (between centroids of cluster A and B)
6
Machine Learning Part I Machine Learning Part II Machine Learning Part III
Comparing ML Algorithms Decision Trees Support Vector Machines
Power and Expressibility: ML methods di↵er in terms Binary branching structure used to classify an arbitrary Work by constructing a hyperplane that separates
of complexity. Linear regression fits linear functions while input vector X. Each node in the tree contains a sim- points between two classes. The hyperplane is de-
NN define piecewise-linear separation boundaries. More ple feature comparison against some field (xi > 42?). termined using the maximal margin hyperplane, which
complex models can provide more accurate models, but Result of each comparison is either true or false, which is the hyperplane that is the maximum distance from
at the risk of overfitting. determines if we should proceed along to the left or the training observations. This distance is called
Interpretability: some models are more transparent right child of the given node. Also known as some- the margin. Points that fall on one side of the
and understandable than others (white box vs. black box times called classification and regression trees (CART). hyperplane are classified as -1 and the other +1.
models)
Ease of Use: some models feature few parame-
ters/decisions (linear regression/NN), while others
require more decision making to optimize (SVMs)
Training Speed: models di↵er in how fast they fit the
necessary parameters
Prediction Speed: models di↵er in how fast they make
predictions given a query
Advantages: Non-linearity, support for categorical
variables, easy to interpret, application to regression. Principal Component Analysis (PCA)
Disadvantages: Prone to overfitting, instable (not Principal components allow us to summarize a set of
robust to noise), high variance, low bias correlated variables with a smaller set of variables that
collectively explain most of the variability in the original
Note: rarely do models just use one decision tree. set. Essentially, we are ”dropping” the least important
Instead, we aggregate many decision trees using methods feature variables.
like ensembling, bagging, and boosting.
Naive Bayes Principal Component Analysis is the process by
Naive Bayes methods are a set of supervised learning Ensembles, Bagging, Random Forests, Boosting which principal components are calculated and the use
algorithms based on applying Bayes’ theorem with the Ensemble learning is the strategy of combining many of them to analyzing and understanding the data. PCA
”naive” assumption of independence between every pair di↵erent classifiers/models into one predictive model. It is an unsupervised approach and is used for dimensional-
of features. revolves around the idea of voting: a so-called ”wisdom of ity reduction, feature extraction, and data visualization.
crowds” approach. The most predicted class will be the Variables after performing PCA are independent. Scal-
Problem: Suppose we need to classify vector X = x1 ...xn final prediction. ing variables is also important while performing PCA.
into m classes, C1 ...Cm . We need to compute the proba- Bagging: ensemble method that works by taking B boot-
bility of each possible class given X, so we can assign X strapped subsamples of the training data and constructing
the label of the class with highest probability. We can B trees, each tree training on a distinct subsample as
calculate a probability using the Bayes’ Theorem: Random Forests: builds on bagging by decorrelating
P (X|Ci )P (Ci ) the trees. We do everything the same like in bagging, but
P (Ci |X) =
P (X) when we build the trees, everytime we consider a split, a
random sample of the p predictors is chosen as split can-
Where: p
didates, not the full set (typically m ⇡ p). When m =
1. P (Ci ): the prior probability of belonging to class i p, then we are just doing bagging.
2. P (X): normalizing constant, or probability of seeing Boosting: the main idea is to improve our model where
the given input vector over all possible input vectors it is not performing well by using information from previ-
3. P (X|Ci ): the conditional probability of seeing ously constructed classifiers. Slow learner. Has 3 tuning
input vector X given we know the class is Ci parameters: number of classifiers B, learning parameter ,
interaction depth d (controls interaction order of model).
The prediction model will formally look like:
P (X|Ci )P (Ci )
C(X) = argmaxi2classes(t) P (X)

where C(X) is the prediction returned for input X.


7
Machine Learning Part IV Deep Learning Part I Deep Learning Part II
ML Terminology and Concepts What is Deep Learning? Tensorflow
Deep learning is a subset of machine learning. One popu- Tensorflow is an open source software library for numeri-
Features: input data/variables used by the ML model lar DL technique is based on Neural Networks (NN), which cal computation using data flow graphs. Everything in
Feature Engineering: transforming input features to loosely mimic the human brain and the code structures TF is a graph, where nodes represent operations on data
be more useful for the models. e.g. mapping categories to are arranged in layers. Each layer’s input is the previous and edges represent the data. Phase 1 of TF is building
buckets, normalizing between -1 and 1, removing null layer’s output, which yields progressively higher-level fea- up a computation graph and phase 2 is executing it. It is
Train/Eval/Test: training is data used to optimize the tures and defines a hierarchy. A Deep Neural Network is also distributed, meaning it can run on either a cluster of
model, evaluation is used to asses the model on new data just a NN that has more than 1 hidden layer. machines or just a single machine.
during training, test is used to provide the final result TF is extremely popular/suitable for working with Neural
Classification/Regression: regression is prediction a Networks, since the way TF sets up the computational
number (e.g. housing price), classification is prediction graph pretty much resembles a NN.
from a set of categories(e.g. predicting red/blue/green)
Linear Regression: predicts an output by multiplying
and summing input features with weights and biases
Logistic Regression: similar to linear regression but
predicts a probability
Overfitting: model performs great on the input data but
poorly on the test data (combat by dropout, early stop-
ping, or reduce # of nodes or layers)
Bias/Variance: how much output is determined by the Recall that statistical learning is all about approximating
features. more variance often can mean overfitting, more f (X). Neural networks are known as universal approx-
bias can mean a bad model imators, meaning no matter how complex a function is,
Regularization: variety of approaches to reduce over- there exists a NN that can (approximately) do the job.
fitting, including adding the weights to the loss function, We can increase the approximation (or complexity) by
randomly dropping layers (dropout) adding more hidden layers and neurons.
Ensemble Learning: training multiple models with dif- Tensors
ferent parameters to solve the same problem Popular Architectures In a graph, tensors are the edges and are multidimensional
A/B testing: statistical way of comparing 2+ techniques There are di↵erent kinds of NNs that are suitable for data arrays that flow through the graph. Central unit
to determine which technique performs better and also if certain problems, which depend on the NN’s architecture. of data in TF and consists of a set of primitive values
di↵erence is statistically significant shaped into an array of any number of dimensions.
Baseline Model: simple model/heuristic used as refer- Linear Classifier: takes input features and combines A tensor is characterized by its rank (# dimensions
ence point for comparing how well a model is performing them with weights and biases to predict output value in tensor), shape (# of dimensions and size of each di-
Bias: prejudice or favoritism towards some things, people, DNN: deep neural net, contains intermediate layers of mension), data type (data type of each element in tensor).
or groups over others that can a↵ect collection/sampling nodes that represent “hidden features” and activation
and interpretation of data, the design of a system, and functions to represent non-linearity Placeholders and Variables
how users interact with a system CNN: convolutional NN, has a combination of convolu- Variables: best way to represent shared, persistent state
Dynamic Model: model that is trained online in a con- tional, pooling, dense layers. popular for image classifica- manipulated by your program. These are the parameters
tinuously updating fashion tion. of the ML model are altered/trained during the training
Static Model: model that is trained o✏ine Transfer Learning: use existing trained models as start- process. Training variables.
Normalization: process of converting an actual range of ing points and add additional layers for the specific use Placeholders: way to specify inputs into a graph that
values into a standard range of values, typically -1 to +1 case. idea is that highly trained existing models know hold the place for a Tensor that will be fed at runtime.
Independently and Identically Distributed (i.i.d): general features that serve as a good starting point for They are assigned once, do not change after. Input nodes
data drawn from a distribution that doesn’t change, and training a small network on specific examples
where each value drawn doesn’t depend on previously RNN: recurrent NN, designed for handling a sequence of
drawn values; ideal but rarely found in real life inputs that have ”memory” of the sequence. LSTMs are
Hyperparameters: the ”knobs” that you tweak during a fancy version of RNNs, popular for NLP
successive runs of training a model GAN: general adversarial NN, one model creates fake ex-
Generalization: refers to a model’s ability to make cor- amples, and another model is served both fake example
rect predictions on new, previously unseen data as op- and real examples and is asked to distinguish
posed to the data used to train the model Wide and Deep: combines linear classifiers with deep
Cross-Entropy: quantifies the di↵erence between two neural net classifiers, ”wide” linear parts represent mem-
orizing specific examples and “deep” parts represent un-
8

probability distributions
derstanding high level features
Deep Learning Part III Big Data- Hadoop Overview Big Data- Hadoop Ecosystem
Deep Learning Terminology and Concepts Data can no longer fit in memory on one machine An entire ecosystem of tools have emerged around
(monolithic), so a new way of computing was devised Hadoop, which are based on interacting with HDFS.
Neuron: node in a NN, typically taking in multiple in- using a group of computers to process this ”big data” Below are some popular ones:
put values and generating one output value, calculates the (distributed). Such a group is called a cluster, which
output value by applying an activation function (nonlin- makes up server farms. All of these servers have to be Hive: data warehouse software built o top of Hadoop that
ear transformation) to a weighted sum of input values coordinated in the following ways: partition data, coor- facilitates reading, writing, and managing large datasets
Weights: edges in a NN, the goal of training is to deter- dinate computing tasks, handle fault tolerance/recovery, residing in distributed storage using SQL-like queries
mine the optimal weight for each feature; if weight = 0, and allocate capacity to process. (HiveQL). Hive abstracts away underlying MapReduce
corresponding feature does not contribute jobs and returns HDFS in the form of tables (not HDFS).
Neural Network: composed of neurons (simple building Hadoop Pig: high level scripting language (Pig Latin) that
blocks that actually “learn”), contains activation functions Hadoop is an open source distributed processing frame- enables writing complex data transformations. It pulls
that makes it possible to predict non-linear outputs work that manages data processing and storage for big unstructured/incomplete data from sources, cleans it, and
Activation Functions: mathematical functions that in- data applications running in clustered systems. It is com- places it in a database/data warehouses. Pig performs
troduce non-linearity to a network e.g. RELU, tanh prised of 3 main components: ETL into data warehouse while Hive queries from data
Sigmoid Function: function that maps very negative • Hadoop Distributed File System (HDFS): warehouse to perform analysis (GCP: DataFlow).
numbers to a number very close to 0, huge numbers close a distributed file system that provides high- Spark: framework for writing fast, distributed programs
to 1, and 0 to .5. Useful for predicting probabilities throughput access to application data by partition- for data processing and analysis. Spark solves similar
Gradient Descent/Backpropagation: fundamental ing data across many machines problems as Hadoop MapReduce but with a fast in-
loss optimizer algorithms, of which the other optimizers • YARN: framework for job scheduling and cluster memory approach. It is an unified engine that supports
are usually based. Backpropagation is similar to gradient resource management (task coordination) SQL queries, streaming data, machine learning and
descent but for neural nets • MapReduce: YARN-based system for parallel graph processing. Can operate separately from Hadoop
Optimizer: operation that changes the weights and bi- processing of large data sets on multiple machines but integrates well with Hadoop. Data is processed
ases to reduce loss e.g. Adagrad or Adam using Resilient Distributed Datasets (RDDs), which are
Weights / Biases: weights are values that the input fea- HDFS immutable, lazily evaluated, and tracks lineage. ;
tures are multiplied by to predict an output value. Biases Each disk on a di↵erent machine in a cluster is comprised Hbase: non-relational, NoSQL, column-oriented
;
are the value of the output given a weight of 0. of 1 master node and the rest are workers/data nodes. database management system that runs on top of
Converge: algorithm that converges will eventually reach The master node manages the overall file system by HDFS. Well suited for sparse data sets (GCP: BigTable)
an optimal answer, even if very slowly. An algorithm that storing the directory structure and the metadata of the Flink/Kafka: stream processing framework. Batch
doesn’t converge may never reach an optimal answer. files. The data nodes physically store the data. Large streaming is for bounded, finite datasets, with periodic
Learning Rate: rate at which optimizers change weights files are broken up and distributed across multiple ma- updates, and delayed processing. Stream processing
and biases. High learning rate generally trains faster but chines, which are also replicated across multiple machines is for unbounded datasets, with continuous updates,
risks not converging, whereas a lower rate trains slower to provide fault tolerance. and immediate processing. Stream data and stream
Numerical Instability: issues with very large/small val- processing must be decoupled via a message queue.
ues due to limits of floating point numbers in computers MapReduce Can group streaming data (windows) using tumbling
Embeddings: mapping from discrete objects, such as Parallel programming paradigm which allows for process- (non-overlapping time), sliding (overlapping time), or
words, to vectors of real numbers. useful because classi- ing of huge amounts of data by running processes on mul- session (session gap) windows.
fiers/neural networks work well on vectors of real numbers tiple machines. Defining a MapReduce job requires two Beam: programming model to define and execute data
Convolutional Layer: series of convolutional opera- stages: map and reduce. processing pipelines, including ETL, batch and stream
tions, each acting on a di↵erent slice of the input matrix • Map: operation to be performed in parallel on small (continuous) processing. After building the pipeline,
Dropout: method for regularization in training NNs, portions of the dataset. the output is a key-value it is executed by one of Beam’s distributed processing
works by removing a random selection of some units in pair < K, V > back-ends (Apache Apex, Apache Flink, Apache Spark,
a network layer for a single gradient step • Reduce: operation to combine the results of Map and Google Cloud Dataflow). Modeled as a Directed
Early Stopping: method for regularization that involves Acyclic Graph (DAG).
ending model training early YARN- Yet Another Resource Negotiator Oozie: workflow scheduler system to manage Hadoop
Gradient Descent: technique to minimize loss by com- Coordinates tasks running on the cluster and assigns new jobs
puting the gradients of loss with respect to the model’s nodes in case of failure. Comprised of 2 subcomponents: Sqoop: transferring framework to transfer large amounts
parameters, conditioned on training data the resource manager and the node manager. The re- of data into HDFS from relational databases (MySQL)
Pooling: Reducing a matrix (or matrices) created by an source manager runs on a single master node and sched-
earlier convolutional layer to a smaller matrix. Pooling ules tasks across nodes. The node manager runs on all
usually involves taking either the maximum or average other nodes and manages tasks on the individual node.
value across the pooled area
9
SQL Part I Python- Data Structures Recommended Resources
Structured Query Language (SQL) is a declarative Data structures are a way of storing and manipulating • Data Science Design Manual
language used to access & manipulate data in databases. data and each data structure has its own strengths and (www.springer.com/us/book/9783319554433)
Usually the database is a Relational Database Man- weaknesses. Combined with algorithms, data structures • Introduction to Statistical Learning
agement System (RDBMS), which stores data arranged allow us to efficiently solve problems. It is important to (www-bcf.usc.edu/~gareth/ISL/)
in relational database tables. A table is arranged in know the main types of data structures that you will need • Probability Cheatsheet
columns and rows, where columns represent character- to efficiently solve problems. (/www.wzchen.com/probability-cheatsheet/)
istics of stored data and rows represent actual data entries. • Google’s Machine Learning Crash Course
Lists: or arrays, ordered sequences of objects, mutable (developers.google.com/machine-learning/
Basic Queries crash-course/)
>>> l = [42, 3.14, "hello","world"]
- filter columns: SELECT col1, col3... FROM table1
- filter the rows: WHERE col4 = 1 AND col5 = 2 Tuples: like lists, but immutable
- aggregate the data: GROUP BY. . .
- limit aggregated data: HAVING count(*) > 1 >>> t = (42, 3.14, "hello","world")
- order of the results: ORDER BY col2 Dictionaries: hash tables, key-value pairs, unsorted

Useful Keywords for SELECT >>> d = {"life": 42, "pi": 3.14}


DISTINCT- return unique results Sets: mutable, unordered sequence of unique elements.
BETWEEN a AND b- limit the range, the values can frozensets are just immutable sets
be numbers, text, or dates
LIKE- pattern search within the column text >>> s = set([42, 3.14, "hello","world"])
IN (a, b, c) - check if the value is contained among given Collections Module
deque: double-ended queue, generalization of stacks and
Data Modification queues; supports append, appendLeft, pop, rotate, etc
- update specific data with the WHERE clause:
UPDATE table1 SET col1 = 1 WHERE col2 = 2 >>> s = deque([42, 3.14, "hello","world"])
- insert values manually
Counter: dict subclass, unordered collection where ele-
INSERT INTO table1 (col1,col3) VALUES (val1,val3);
ments are stored as keys and counts stored as values
- by using the results of a query
INSERT INTO table1 (col1,col3) SELECT col,col2 >>> c = Counter('apple')
FROM table2; >>> print(c)
Counter({'p': 2, 'a': 1, 'l': 1, 'e': 1})
Joins
heqpq Module
The JOIN clause is used to combine rows from two or more
Heap Queue: priority queue, heaps are binary trees for
tables, based on a related column between them.
which every parent node has a value greater than or equal
to any of its children (max-heap), order is important; sup-
ports push, pop, pushpop, heapify, replace functionality
>>> heap = []
>>> for n in data:
... heappush(heap, n)
>>> heap
[0, 1, 3, 6, 2, 8, 4, 7, 9, 5]
10
Business Science 
Problem Framework 
View  Uncover  Report 
Understand  Measure  Encode  Measure
Business  Problems & Financial 
the Drivers the Drivers Algorithms Results
as a Machine Opportunities Impact

Step 1: Step 1: Step 1: Step 1: Step 1: Step 1:  Step 1:


Isolate business unit Investigate if objectives Collect Data Evaluate performance vs Develop algorithms to Capture outcomes after Measure actual results.
are being met KPIs predict and explain decision making system
Step 2:  Step 2:  problem is implemented Step 2:
Define objectives.  Step 2:  Develop KPIs Step 2: Tie to financial benefits
Define machine in terms Synthesize outcomes Highlight potential Step 2:  Step 2:
of people and processes problem areas Tie financial value of Synthesize results in Step 3:
Step 3: individual decisions to terms of good and bad Report financial benefit of
Step 3: Hypothesize drivers Step 3: optimize for profit outcomes identifying algorithms to key
Collect outcomes in Review process and what was done and what stakeholders
terms of feedback. consider what could be Step 3: happened
Feedback identifies missed or needed to Use recommendation
problems.  answer questions algorithms to improve Step 3:
decisions Visualize outcomes over
time to determine
progress

CRISP Phase 1: Business Understanding

CRISP Phase 2: Data Understanding

CRISP Phase 3: Data Preparation

CRISP Phase 4: Modeling

CRISP Phase 5: Evaluation


                                                  
CRISP Phase 6: Deployment
CRISP-DM
11

Learn to apply the BSPF in Data Science For Business With R (DS4B 201-R) Version 1.0
Data Science with
Python Workflow
If you want to learn Python and this workflow for business
analysis, take the Python For Business Analysis (DS4B
101-P) course through Business Science University.
Click the links for
Documentation  

CS = Cheat Sheet
matplotlib 
seaborn 

Pandas 
text 
Visualize
time series 
Pandas  categorical 
(CS)  missing 

Transform
Import Tidy Communicate

JupyterLab 
Data  Web  Pandas  Jupyter Notebook 
I/O tools  Beautiful Soup  data structures  Model
Django 
  Scrappy  group by  Flask 
joins 
reshape (pivot) 
Scikit­Learn 
Featuretools
Jupyter & Pycharm 

Important Resources
Anaconda Distribution: https://www.anaconda.com/download/
Python Documentation: https://docs.python.org/
Python Standard Library: https://docs.python.org/3/library

Business Science University
"Data Science Education for the Enterprise" university.business­science.io
12

version: 0.0
Data Science with R
Workflow
The Data Science With R Workflow is available in the book: R
For Data Science. If you want to learn R and this
workflow for business analysis, take the R For Business
Analysis (DS4B 101-R) course through Business Science
University. Click the links for
Documentation  

ggplot2 (CS) 

dplyr (CS)  Visualize
stringr (CS) 
lubridate (CS) 
forcats 
Base R (CS) 
Transform purrr (CS)
Import Tidy (iteration)  Communicate

readr (CS) 
readxl / writexl  tibble (CS)  RMarkdown (CS) 
Model
odbc / DBI  tidyr (CS)  Shiny (CS) 
rvest 
 
recipes  broom 
rsample  yardstick 
RStudio IDE (CS) fs (file system)   parsnip  dials 

CS = Cheat Sheet

Important Resources
R For Data Science Book: http://r4ds.had.co.nz/
Rmarkdown Book: https://bookdown.org/yihui/rmarkdown/
Data Visualization Book: https://rkabacoff.github.io/datavis/
More Cheatsheets: https://www.rstudio.com/resources/cheatsheets/ 
tidyverse packages: https://www.tidyverse.org/
Connecting to databases: https://db.rstudio.com/ 
RMarkdown website: https://rmarkdown.rstudio.com/ 
Shiny web applications website: http://shiny.rstudio.com/
Jenny Bryan's purrr tutorial: https://jennybryan.org/
Business Science University
"Data Science Education for the Enterprise" university.business­science.io
13

version: 1.1
Data Science with   Text Analysis & NLP Machine Learning
Special Topics Multi-Threaded/Scalable/Production ML: 
Text Mining with R (Book): tidytext 
NLP:  H2O (CS)
H2O word2vec: Word embeddings Extreme Gradient Boosting: xgboost
text2vec: fast vectorization, topic modeling R + Spark: sparklyr (CS)
udpipe: UDPipe C++ lib in R Sparkling Water (Spark + H2O): rsparkling
Time Series Analysis ML (Tidy): parsnip
ML: caret (CS)
Time-aware tibbles: tibbletime & tsibble
Convert between classes: timetk & tsbox Network Analysis
Time Series Index Summary: timetk Deep Learning
Generating Future Series: timetk Network Data Transformations (Tidy): tidygraph
Network Data Transformations: igraph R Interface to TensorFlow Homepage:
Keras (CS)
Forecasting TF Estimators
Network Viz TensorFlow (Core)
ARIMA, ETS, etc: forecast & fable
Static: 
Tidy, glance, augment for forecast models: sweep
ggraph - Graph plotting utilities for ggplot2
Converting forecast prediction to tibble: sweep
Interactive (JavaScript):
networkD3 - D3 Networks in R
plotly - plotly.js (network graphs) in R Speed & Scale 
Anomaly Detection
Fastest Single-Node Speed: data.table (CS)
Identify anomalies: anomalize Distributed Cluster (Spark): sparklyr (CS)

Geospatial Analysis
Geocoding (getting lat/long, bboxes, & sf's): 
Interoperability
ggmap - Google API (requires key)
Python: reticulate Java: rJava
Financial Analysis osmdata - OpenStreet Overpass API
C++: Rcpp
tmaptools - OpenStreet Nominatum API
Simple Features (sf objects): sf (CS) (tidy)
Getting financial data: tidyquant & quantmod Spatial Objects (sp objects): sp (non-tidy)
Quantitative Analysis: tidyquant & xts/TTR
Portfolio Analysis: tidyquant & Miscellaneous Tools
PerformanceAnalytics
Geospatial Viz Interactive Plotting: htmlwidgets for R 
Building R Packages: R packages Book
Financial & Time Viz Static:  Pkg Development Tools: devtools (CS)
ggmap - Google API (requires key) R Templates: usethis 
Static: osmplotr - Impressive Maps via OSM Build Web Doc's: pkgdown
tidyquant - Financial ggplot2 geoms tmap - Thematic Maps Advanced Concepts (Advanced R Book)
Interactive: cartography (CS) - Thematic Maps rlang & Tidy Evaluation (CS)
highcharter - highchart.js in R Interactive (JavaScript): Making Blogs & Books:
dygraphs - xts plotting leaflet (CS) - leaflet.js in R Make a Website/Blog: blogdown
plotly - plotly.js (financial) in R plotly - plotly.js (maps) in R Write a Web Book: bookdown
Posting Code (GitHub, Stack Overflow): reprex

Business Science University
"Data Science Education for the Enterprise" university.business­science.io
14
Working with Different Programming Languages Widgets
Python For Data Science Cheat Sheet Kernels provide computation and communication with front-end interfaces Notebook widgets provide the ability to visualize and control changes
Jupyter Notebook like the notebooks. There are three main kernels: in your data, often as a control like a slider, textbox, etc.
Learn More Python for Data Science Interactively at www.DataCamp.com
You can use them to build interactive GUIs for your notebooks or to
IRkernel IJulia
synchronize stateful and stateless information between Python and
Installing Jupyter Notebook will automatically install the IPython kernel. JavaScript.
Saving/Loading Notebooks Restart kernel Interrupt kernel
Create new notebook Restart kernel & run Interrupt kernel & Download serialized Save notebook
all cells clear all output state of all widget with interactive
Open an existing
Connect back to a models in use widgets
Make a copy of the notebook Restart kernel & run remote notebook
current notebook all cells Embed current
Rename notebook Run other installed
widgets
kernels
Revert notebook to a
Save current notebook
previous checkpoint Command Mode:
and record checkpoint
Download notebook as
Preview of the printed - IPython notebook 15
notebook - Python
- HTML
Close notebook & stop - Markdown 13 14
- reST
running any scripts - LaTeX 1 2 3 4 5 6 7 8 9 10 11 12
- PDF

Writing Code And Text


Code and text are encapsulated by 3 basic cell types: markdown cells, code
cells, and raw NBConvert cells.
Edit Mode: 1. Save and checkpoint 9. Interrupt kernel
Edit Cells 2. Insert cell below 10. Restart kernel
3. Cut cell 11. Display characteristics
Cut currently selected cells Copy cells from 4. Copy cell(s) 12. Open command palette
to clipboard clipboard to current 5. Paste cell(s) below 13. Current kernel
cursor position 6. Move cell up 14. Kernel status
Paste cells from Executing Cells 15. Log out from notebook server
7. Move cell down
clipboard above Paste cells from 8. Run current cell
current cell Run selected cell(s) Run current cells down
clipboard below
and create a new one
Paste cells from current cell Asking For Help
below
clipboard on top Run current cells down
Delete current cells
of current cel and create a new one Walk through a UI tour
Split up a cell from above Run all cells
Revert “Delete Cells” List of built-in keyboard
current cursor Run all cells above the Run all cells below
invocation shortcuts
position current cell the current cell Edit the built-in
Merge current cell Merge current cell keyboard shortcuts
Change the cell type of toggle, toggle Notebook help topics
with the one above with the one below current cell scrolling and clear Description of
Move current cell up Move current cell toggle, toggle current outputs markdown available Information on
down scrolling and clear in notebook unofficial Jupyter
Adjust metadata
underlying the Find and replace all output Notebook extensions
Python help topics
current notebook in selected cells IPython help topics
View Cells
Remove cell Copy attachments of NumPy help topics
attachments current cell Toggle display of Jupyter SciPy help topics
Toggle display of toolbar Matplotlib help topics
Paste attachments of Insert image in logo and filename
SymPy help topics
current cell selected cells Toggle display of cell Pandas help topics
action icons:
Insert Cells - None About Jupyter Notebook
- Edit metadata
Toggle line numbers - Raw cell format
Add new cell above the Add new cell below the - Slideshow
current one in cells - Attachments
current one
15

- Tags DataCamp
Learn Python for Data Science Interactively
LEARN DATA SCIENCE ONLINE
16
Start Learning For Free - www.dataquest.io

Data Science Cheat Sheet


Python Basics

BASICS, PRINTING AND GETTING HELP


x = 3 - Assign 3 to the variable x help(x) - Show documentation for the str data type
print(x) - Print the value of x help(print) - Show documentation for the print() function
type(x) - Return the type of the variable x (in this case, int for integer)

READING FILES 3 ** 2 - Raise 3 to the power of 2 (or 32) def calculate(addition_one,addition_two,


f = open("my_file.txt","r") 27 ** (1/3) - The 3rd root of 27 (or 3√27) exponent=1,factor=1):
file_as_string = f.read() x += 1 - Assign the value of x + 1 to x result = (value_one + value_two) ** exponent * factor
- Open the file my_file.txt and assign its x -= 1 - Assign the value of x - 1 to x return result
contents to s - Define a new function calculate with two
import csv L I STS required and two optional named arguments
f = open("my_dataset.csv","r") l = [100,21,88,3] - Assign a list containing the which calculates and returns a result.
csvreader = csv.reader(f) integers 100, 21, 88, and 3 to the variable l addition(3,5,factor=10) - Run the addition
csv_as_list = list(csvreader) l = list() - Create an empty list and assign the function with the values 3 and 5 and the named
- Open the CSV file my_dataset.csv and assign its result to l argument 10
data to the list of lists csv_as_list l[0] - Return the first value in the list l
l[-1] - Return the last value in the list l B O O L E A N C O M PA R I S O N S
ST R I N G S l[1:3] - Return a slice (list) containing the second x == 5 - Test whether x is equal to 5
s = "hello" - Assign the string "hello" to the and third values of l x != 5 - Test whether x is not equal to 5
variable s len(l) - Return the number of elements in l x > 5 - Test whether x is greater than 5
s = """She said, sum(l) - Return the sum of the values of l x < 5 - Test whether x is less than 5
"there's a good idea." min(l) - Return the minimum value from l x >= 5 - Test whether x is greater than or equal to 5
""" max(l) - Return the maximum value from l x <= 5 - Test whether x is less than or equal to 5
- Assign a multi-line string to the variable s. Also l.append(16) - Append the value 16 to the end of l x == 5 or name == "alfred" - Test whether x is
used to create strings that contain both " and ' l.sort() - Sort the items in l in ascending order equal to 5 or name is equal to "alfred"
characters " ".join(["A","B","C","D"]) - Converts the list x == 5 and name == "alfred" - Test whether x is
len(s) - Return the number of characters in s ["A", "B", "C", "D"] into the string "A B C D" equal to 5 and name is equal to "alfred"
s.startswith("hel") - Test whether s starts with 5 in l - Checks whether the value 5 exists in the list l
the substring "hel" DICTIONARIES "GB" in d - Checks whether the value "GB" exists in
s.endswith("lo") - Test whether s ends with the d = {"CA":"Canada","GB":"Great Britain", the keys for d
substring "lo" "IN":"India"} - Create a dictionary with keys of
"{} plus {} is {}".format(3,1,4) - Return the "CA", "GB", and "IN" and corresponding values I F STAT E M E N TS A N D LO O P S
string with the values 3, 1, and 4 inserted of of "Canada", "Great Britain", and "India" The body of if statements and loops are defined
s.replace("e","z") - Return a new string based d["GB"] - Return the value from the dictionary d through indentation.
on s with all occurances of "e" replaced with "z" that has the key "GB" if x > 5:
s.split(" ") - Split the string s into a list of d.get("AU","Sorry") - Return the value from the print("{} is greater than five".format(x))
strings, separating on the character " " and dictionary d that has the key "AU", or the string elif x < 0:
return that list "Sorry" if the key "AU" is not found in d print("{} is negative".format(x))
d.keys() - Return a list of the keys from d else:
NUMERIC TYPES AND d.values() - Return a list of the values from d print("{} is between zero and five".format(x))
M AT H E M AT I C A L O P E R AT I O N S d.items() - Return a list of (key, value) pairs - Test the value of the variable x and run the code
i = int("5") - Convert the string "5" to the from d body based on the value
integer 5 and assign the result to i for value in l:
f = float("2.5") - Convert the string "2.5" to MODULES AND FUNCTIONS print(value)
the float value 2.5 and assign the result to f The body of a function is defined through - Iterate over each value in l, running the code in
5 + 5 - Addition indentation. the body of the loop with each iteration
5 - 5 - Subtraction import random - Import the module random while x < 10:
10 / 2 - Division from math import sqrt - Import the function x += 1
5 * 2 - Multiplication sqrt from the module math - Run the code in the body of the loop until the
value of x is no longer less than 10

LEARN DATA SCIENCE ONLINE


Start Learning For Free - www.dataquest.io
LEARN DATA SCIENCE ONLINE
17
Start Learning For Free - www.dataquest.io

Data Science Cheat Sheet


Python - Intermediate

KEY BASICS, PRINTING AND GETTING HELP


This cheat sheet assumes you are familiar with the content of our Python Basics Cheat Sheet

s - A Python string variable l - A Python list variable


i - A Python integer variable d - A Python dictionary variable
f - A Python float variable

L I STS len(my_set) - Returns the number of objects in now - wks4 - Return a datetime object
l.pop(3) - Returns the fourth item from l and my_set (or, the number of unique values from l) representing the time 4 weeks prior to now
deletes it from the list a in my_set - Returns True if the value a exists in newyear_2020 = dt.datetime(year=2020,
l.remove(x) - Removes the first item in l that is my_set month=12, day=31) - Assign a datetime
equal to x object representing December 25, 2020 to
l.reverse() - Reverses the order of the items in l REGULAR EXPRESSIONS newyear_2020
l[1::2] - Returns every second item from l, import re - Import the Regular Expressions module newyear_2020.strftime("%A, %b %d, %Y")
commencing from the 1st item re.search("abc",s) - Returns a match object if - Returns "Thursday, Dec 31, 2020"
l[-5:] - Returns the last 5 items from l specific axis the regex "abc" is found in s, otherwise None dt.datetime.strptime('Dec 31, 2020',"%b
re.sub("abc","xyz",s) - Returns a string where %d, %Y") - Return a datetime object
ST R I N G S all instances matching regex "abc" are replaced representing December 31, 2020
s.lower() - Returns a lowercase version of s by "xyz"
s.title() - Returns s with the first letter of every RANDOM
word capitalized L I ST C O M P R E H E N S I O N import random - Import the random module
"23".zfill(4) - Returns "0023" by left-filling the A one-line expression of a for loop random.random() - Returns a random float
string with 0’s to make it’s length 4. [i ** 2 for i in range(10)] - Returns a list of between 0.0 and 1.0
s.splitlines() - Returns a list by splitting the the squares of values from 0 to 9 random.randint(0,10) - Returns a random
string on any newline characters. [s.lower() for s in l_strings] - Returns the integer between 0 and 10
Python strings share some common methods with lists list l_strings, with each item having had the random.choice(l) - Returns a random item from
s[:5] - Returns the first 5 characters of s .lower() method applied the list l
"fri" + "end" - Returns "friend" [i for i in l_floats if i < 0.5] - Returns
"end" in s - Returns True if the substring "end" the items from l_floats that are less than 0.5 COUNTER
is found in s from collections import Counter - Import the
F U N C T I O N S F O R LO O P I N G Counter class
RANGE for i, value in enumerate(l): c = Counter(l) - Assign a Counter (dict-like)
Range objects are useful for creating sequences of print("The value of item {} is {}". object with the counts of each unique item from
integers for looping. format(i,value)) l, to c
range(5) - Returns a sequence from 0 to 4 - Iterate over the list l, printing the index location c.most_common(3) - Return the 3 most common
range(2000,2018) - Returns a sequence from 2000 of each item and its value items from l
to 2017 for one, two in zip(l_one,l_two):
range(0,11,2) - Returns a sequence from 0 to 10, print("one: {}, two: {}".format(one,two)) T RY/ E XC E P T
with each item incrementing by 2 - Iterate over two lists, l_one and l_two and print Catch and deal with Errors
range(0,-10,-1) - Returns a sequence from 0 to -9 each value l_ints = [1, 2, 3, "", 5] - Assign a list of
list(range(5)) - Returns a list from 0 to 4 while x < 10: integers with one missing value to l_ints
x += 1 l_floats = []
DICTIONARIES - Run the code in the body of the loop until the for i in l_ints:
max(d, key=d.get) - Return the key that value of x is no longer less than 10 try:
corresponds to the largest value in d l_floats.append(float(i))
min(d, key=d.get) - Return the key that DAT E T I M E except:
corresponds to the smallest value in d import datetime as dt - Import the datetime l_floats.append(i)
module - Convert each value of l_ints to a float, catching
S E TS now = dt.datetime.now() - Assign datetime and handling ValueError: could not convert
my_set = set(l) - Return a set object containing object representing the current time to now string to float: where values are missing.
the unique values from l wks4 = dt.datetime.timedelta(weeks=4)
- Assign a timedelta object representing a
timespan of 4 weeks to wks4

LEARN DATA SCIENCE ONLINE


Start Learning For Free - www.dataquest.io
LEARN DATA SCIENCE ONLINE
18
Start Learning For Free - www.dataquest.io

Data Science Cheat Sheet


NumPy

KEY IMPORTS
We’ll use shorthand in this cheat sheet Import these to start
arr - A numpy Array object import numpy as np

I M P O RT I N G/ E X P O RT I N G arr.T - Transposes arr (rows become columns and S C A L A R M AT H


np.loadtxt('file.txt') - From a text file vice versa) np.add(arr,1) - Add 1 to each array element
np.genfromtxt('file.csv',delimiter=',') arr.reshape(3,4) - Reshapes arr to 3 rows, 4 np.subtract(arr,2) - Subtract 2 from each array
- From a CSV file columns without changing data element
np.savetxt('file.txt',arr,delimiter=' ') arr.resize((5,6)) - Changes arr shape to 5x6 np.multiply(arr,3) - Multiply each array
- Writes to a text file and fills new values with 0 element by 3
np.savetxt('file.csv',arr,delimiter=',') np.divide(arr,4) - Divide each array element by
- Writes to a CSV file A D D I N G/ R E M OV I N G E L E M E N TS 4 (returns np.nan for division by zero)
np.append(arr,values) - Appends values to end np.power(arr,5) - Raise each array element to
C R E AT I N G A R R AYS of arr the 5th power
np.array([1,2,3]) - One dimensional array np.insert(arr,2,values) - Inserts values into
np.array([(1,2,3),(4,5,6)]) - Two dimensional arr before index 2 V E C TO R M AT H
array np.delete(arr,3,axis=0) - Deletes row on index np.add(arr1,arr2) - Elementwise add arr2 to
np.zeros(3) - 1D array of length 3 all values 0 3 of arr arr1
np.ones((3,4)) - 3x4 array with all values 1 np.delete(arr,4,axis=1) - Deletes column on np.subtract(arr1,arr2) - Elementwise subtract
np.eye(5) - 5x5 array of 0 with 1 on diagonal index 4 of arr arr2 from arr1
(Identity matrix) np.multiply(arr1,arr2) - Elementwise multiply
np.linspace(0,100,6) - Array of 6 evenly divided C O M B I N I N G/S P L I T T I N G arr1 by arr2
values from 0 to 100 np.concatenate((arr1,arr2),axis=0) - Adds np.divide(arr1,arr2) - Elementwise divide arr1
np.arange(0,10,3) - Array of values from 0 to less arr2 as rows to the end of arr1 by arr2
than 10 with step 3 (eg [0,3,6,9]) np.concatenate((arr1,arr2),axis=1) - Adds np.power(arr1,arr2) - Elementwise raise arr1
np.full((2,3),8) - 2x3 array with all values 8 arr2 as columns to end of arr1 raised to the power of arr2
np.random.rand(4,5) - 4x5 array of random floats np.split(arr,3) - Splits arr into 3 sub-arrays np.array_equal(arr1,arr2) - Returns True if the
between 0-1 np.hsplit(arr,5) - Splits arr horizontally on the arrays have the same elements and shape
np.random.rand(6,7)*100 - 6x7 array of random 5th index np.sqrt(arr) - Square root of each element in the
floats between 0-100 array
np.random.randint(5,size=(2,3)) - 2x3 array I N D E X I N G/S L I C I N G/S U B S E T T I N G np.sin(arr) - Sine of each element in the array
with random ints between 0-4 arr[5] - Returns the element at index 5 np.log(arr) - Natural log of each element in the
arr[2,5] - Returns the 2D array element on index array
I N S P E C T I N G P R O P E RT I E S [2][5] np.abs(arr) - Absolute value of each element in
arr.size - Returns number of elements in arr arr[1]=4 - Assigns array element on index 1 the the array
arr.shape - Returns dimensions of arr (rows, value 4 np.ceil(arr) - Rounds up to the nearest int
columns) arr[1,3]=10 - Assigns array element on index np.floor(arr) - Rounds down to the nearest int
arr.dtype - Returns type of elements in arr [1][3] the value 10 np.round(arr) - Rounds to the nearest int
arr.astype(dtype) - Convert arr elements to arr[0:3] - Returns the elements at indices 0,1,2
type dtype (On a 2D array: returns rows 0,1,2) STAT I ST I C S
arr.tolist() - Convert arr to a Python list arr[0:3,4] - Returns the elements on rows 0,1,2 np.mean(arr,axis=0) - Returns mean along
np.info(np.eye) - View documentation for at column 4 specific axis
np.eye arr[:2] - Returns the elements at indices 0,1 (On arr.sum() - Returns sum of arr
a 2D array: returns rows 0,1) arr.min() - Returns minimum value of arr
C O P Y I N G/S O RT I N G/ R E S H A P I N G arr[:,1] - Returns the elements at index 1 on all arr.max(axis=0) - Returns maximum value of
np.copy(arr) - Copies arr to new memory rows specific axis
arr.view(dtype) - Creates view of arr elements arr<5 - Returns an array with boolean values np.var(arr) - Returns the variance of array
with type dtype (arr1<3) & (arr2>5) - Returns an array with np.std(arr,axis=1) - Returns the standard
arr.sort() - Sorts arr boolean values deviation of specific axis
arr.sort(axis=0) - Sorts specific axis of arr ~arr - Inverts a boolean array arr.corrcoef() - Returns correlation coefficient
two_d_arr.flatten() - Flattens 2D array arr[arr<5] - Returns array elements smaller than 5 of array
two_d_arr to 1D

LEARN DATA SCIENCE ONLINE


Start Learning For Free - www.dataquest.io
LEARN DATA SCIENCE ONLINE
19
Start Learning For Free - www.dataquest.io

Data Science Cheat Sheet


Pandas

KEY IMPORTS
We’ll use shorthand in this cheat sheet Import these to start
df - A pandas DataFrame object import pandas as pd
s - A pandas Series object import numpy as np

I M P O RT I N G DATA SELECTION col1 in ascending order then col2 in descending


pd.read_csv(filename) - From a CSV file df[col] - Returns column with label col as Series order
pd.read_table(filename) - From a delimited text df[[col1, col2]] - Returns Columns as a new df.groupby(col) - Returns a groupby object for
file (like TSV) DataFrame values from one column
pd.read_excel(filename) - From an Excel file s.iloc[0] - Selection by position df.groupby([col1,col2]) - Returns a groupby
pd.read_sql(query, connection_object) - s.loc[0] - Selection by index object values from multiple columns
Reads from a SQL table/database df.iloc[0,:] - First row df.groupby(col1)[col2].mean() - Returns the
pd.read_json(json_string) - Reads from a JSON df.iloc[0,0] - First element of first column mean of the values in col2, grouped by the
formatted string, URL or file. values in col1 (mean can be replaced with
pd.read_html(url) - Parses an html URL, string or DATA C L E A N I N G almost any function from the statistics section)
file and extracts tables to a list of dataframes df.columns = ['a','b','c'] - Renames columns df.pivot_table(index=col1,values=
pd.read_clipboard() - Takes the contents of your pd.isnull() - Checks for null Values, Returns [col2,col3],aggfunc=mean) - Creates a pivot
clipboard and passes it to read_table() Boolean Array table that groups by col1 and calculates the
pd.DataFrame(dict) - From a dict, keys for pd.notnull() - Opposite of s.isnull() mean of col2 and col3
columns names, values for data as lists df.dropna() - Drops all rows that contain null df.groupby(col1).agg(np.mean) - Finds the
values average across all columns for every unique
E X P O RT I N G DATA df.dropna(axis=1) - Drops all columns that column 1 group
df.to_csv(filename) - Writes to a CSV file contain null values df.apply(np.mean) - Applies a function across
df.to_excel(filename) - Writes to an Excel file df.dropna(axis=1,thresh=n) - Drops all rows each column
df.to_sql(table_name, connection_object) - have have less than n non null values df.apply(np.max, axis=1) - Applies a function
Writes to a SQL table df.fillna(x) - Replaces all null values with x across each row
df.to_json(filename) - Writes to a file in JSON s.fillna(s.mean()) - Replaces all null values with
format the mean (mean can be replaced with almost J O I N /C O M B I N E
df.to_html(filename) - Saves as an HTML table any function from the statistics section) df1.append(df2) - Adds the rows in df1 to the
df.to_clipboard() - Writes to the clipboard s.astype(float) - Converts the datatype of the end of df2 (columns should be identical)
series to float pd.concat([df1, df2],axis=1) - Adds the
C R E AT E T E ST O B J E C TS s.replace(1,'one') - Replaces all values equal to columns in df1 to the end of df2 (rows should be
Useful for testing 1 with 'one' identical)
pd.DataFrame(np.random.rand(20,5)) - 5 s.replace([1,3],['one','three']) - Replaces df1.join(df2,on=col1,how='inner') - SQL-style
columns and 20 rows of random floats all 1 with 'one' and 3 with 'three' joins the columns in df1 with the columns
pd.Series(my_list) - Creates a series from an df.rename(columns=lambda x: x + 1) - Mass on df2 where the rows for col have identical
iterable my_list renaming of columns values. how can be one of 'left', 'right',
df.index = pd.date_range('1900/1/30', df.rename(columns={'old_name': 'new_ 'outer', 'inner'
periods=df.shape[0]) - Adds a date index name'}) - Selective renaming
df.set_index('column_one') - Changes the index STAT I ST I C S
V I E W I N G/ I N S P E C T I N G DATA df.rename(index=lambda x: x + 1) - Mass These can all be applied to a series as well.
df.head(n) - First n rows of the DataFrame renaming of index df.describe() - Summary statistics for numerical
df.tail(n) - Last n rows of the DataFrame columns
df.shape() - Number of rows and columns F I LT E R, S O RT, & G R O U P BY df.mean() - Returns the mean of all columns
df.info() - Index, Datatype and Memory df[df[col] > 0.5] - Rows where the col column df.corr() - Returns the correlation between
information is greater than 0.5 columns in a DataFrame
df.describe() - Summary statistics for numerical df[(df[col] > 0.5) & (df[col] < 0.7)] - df.count() - Returns the number of non-null
columns Rows where 0.7 > col > 0.5 values in each DataFrame column
s.value_counts(dropna=False) - Views unique df.sort_values(col1) - Sorts values by col1 in df.max() - Returns the highest value in each
values and counts ascending order column
df.apply(pd.Series.value_counts) - Unique df.sort_values(col2,ascending=False) - Sorts df.min() - Returns the lowest value in each column
values and counts for all columns values by col2 in descending order df.median() - Returns the median of each column
df.sort_values([col1,col2], df.std() - Returns the standard deviation of each
ascending=[True,False]) - Sorts values by column

LEARN DATA SCIENCE ONLINE


Start Learning For Free - www.dataquest.io
LEARN DATA SCIENCE ONLINE
20
Start Learning For Free - www.dataquest.io

Data Science Cheat Sheet


Python Regular Expressions

S P E C I A L C H A R AC T E R S \A | Matches the expression to its right at the (?:A) | Matches the expression as represented
^ | Matches the expression to its right at the absolute start of a string whether in single by A, but unlike (?PAB), it cannot be
start of a string. It matches every such or multi-line mode. retrieved afterwards.
instance before each \n in the string. \Z | Matches the expression to its left at the (?#...) | A comment. Contents are for us to
$ | Matches the expression to its left at the absolute end of a string whether in single read, not for matching.
end of a string. It matches every such or multi-line mode. A(?=B) | Lookahead assertion. This matches
instance before each \n in the string. the expression A only if it is followed by B.
. | Matches any character except line A(?!B) | Negative lookahead assertion. This
terminators like \n. S E TS matches the expression A only if it is not
\ | Escapes special characters or denotes [ ] | Contains a set of characters to match. followed by B.
character classes. [amk] | Matches either a, m, or k. It does not (?<=B)A | Positive lookbehind assertion.
A|B | Matches expression A or B. If A is match amk. This matches the expression A only if B
matched first, B is left untried. [a-z] | Matches any alphabet from a to z. is immediately to its left. This can only
+ | Greedily matches the expression to its left 1 [a\-z] | Matches a, -, or z. It matches - matched fixed length expressions.
or more times. because \ escapes it. (?<!B)A | Negative lookbehind assertion.
* | Greedily matches the expression to its left [a-] | Matches a or -, because - is not being This matches the expression A only if B is
0 or more times. used to indicate a series of characters. not immediately to its left. This can only
? | Greedily matches the expression to its left [-a] | As above, matches a or -. matched fixed length expressions.
0 or 1 times. But if ? is added to qualifiers [a-z0-9] | Matches characters from a to z (?P=name) | Matches the expression matched
(+, *, and ? itself) it will perform matches in and also from 0 to 9. by an earlier group named “name”.
a non-greedy manner. [(+*)] | Special characters become literal (...)\1 | The number 1 corresponds to
{m} | Matches the expression to its left m inside a set, so this matches (, +, *, and ). the first group to be matched. If we want
times, and not less. [^ab5] | Adding ^ excludes any character in to match more instances of the same
{m,n} | Matches the expression to its left m to the set. Here, it matches characters that are expression, simply use its number instead of
n times, and not less. not a, b, or 5. writing out the whole expression again. We
{m,n}? | Matches the expression to its left m can use from 1 up to 99 such groups and
times, and ignores n. See ? above. their corresponding numbers.
GROUPS
( ) | Matches the expression inside the
C H A R AC T E R C L AS S E S parentheses and groups it. POPULAR PYTHON RE MODULE
( A. K.A. S P E C I A L S E Q U E N C E S) (?) | Inside parentheses like this, ? acts as an FUNCTIONS
\w | Matches alphanumeric characters, which extension notation. Its meaning depends on re.findall(A, B) | Matches all instances
means a-z, A-Z, and 0-9. It also matches the character immediately to its right. of an expression A in a string B and returns
the underscore, _. (?PAB) | Matches the expression AB, and it them in a list.
\d | Matches digits, which means 0-9. can be accessed with the group name. re.search(A, B) | Matches the first instance
\D | Matches any non-digits. (?aiLmsux) | Here, a, i, L, m, s, u, and x are of an expression A in a string B, and returns
\s | Matches whitespace characters, which flags: it as a re match object.
include the \t, \n, \r, and space characters. a — Matches ASCII only re.split(A, B) | Split a string B into a list
\S | Matches non-whitespace characters. i — Ignore case using the delimiter A.
\b | Matches the boundary (or empty string) L — Locale dependent re.sub(A, B, C) | Replace A with B in the
at the start and end of a word, that is, m — Multi-line string C.
between \w and \W. s — Matches all
\B | Matches where \b does not, that is, the u — Matches unicode
boundary of \w characters. x — Verbose

LEARN DATA SCIENCE ONLINE


Start Learning For Free - www.dataquest.io
Python For Data Science Cheat Sheet Excel Spreadsheets Pickled Files
>>> file = 'urbanpop.xlsx' >>> import pickle
Importing Data >>> data = pd.ExcelFile(file) >>> with open('pickled_fruit.pkl', 'rb') as file:
>>> df_sheet2 = data.parse('1960-1966', pickled_data = pickle.load(file)
Learn Python for data science Interactively at www.DataCamp.com skiprows=[0],
names=['Country',
'AAM: War(2002)'])
>>> df_sheet1 = data.parse(0, HDF5 Files
parse_cols=[0],
Importing Data in Python skiprows=[0], >>> import h5py
names=['Country']) >>> filename = 'H-H1_LOSC_4_v1-815411200-4096.hdf5'
Most of the time, you’ll use either NumPy or pandas to import >>> data = h5py.File(filename, 'r')
your data: To access the sheet names, use the sheet_names a!ribute:
>>> import numpy as np >>> data.sheet_names
>>> import pandas as pd Matlab Files
SAS Files >>> import scipy.io
Help >>> filename = 'workspace.mat'
>>> from sas7bdat import SAS7BDAT >>> mat = scipy.io.loadmat(filename)
>>> np.info(np.ndarray.dtype)
>>> help(pd.read_csv) >>> with SAS7BDAT('urbanpop.sas7bdat') as file:
df_sas = file.to_data_frame()

Text Files Exploring Dictionaries


Stata Files Accessing Elements with Functions
Plain Text Files >>> data = pd.read_stata('urbanpop.dta') >>> print(mat.keys()) Print dictionary keys
>>> filename = 'huck_finn.txt' >>> for key in data.keys(): Print dictionary keys
>>> file = open(filename, mode='r') Open the file for reading print(key)
meta
>>> text = file.read() Read a file’s contents Relational Databases quality
>>> print(file.closed) Check whether file is closed
>>> from sqlalchemy import create_engine strain
>>> file.close() Close file
>>> print(text) >>> engine = create_engine('sqlite://Northwind.sqlite') >>> pickled_data.values() Return dictionary values
>>> print(mat.items()) Returns items in list format of (key, value)
e Use the table_names() method to fetch a list of table names: tuple pairs
Using the context manager with
x >>> table_names = engine.table_names() Accessing Data Items with Keys
>>> with open('huck_finn.txt',
'r') as file:
print(file.readline()) Read a single line
print(file.readline()) Querying Relational Databases >>> for key in data ['meta'].keys() Explore the HDF5 structure
print(file.readline()) print(key)
>>> con = engine.connect() Description
>>> rs = con.execute("SELECT * FROM Orders") DescriptionURL
Table Data: Flat Files >>> df = pd.DataFrame(rs.fetchall()) Detector
>>> df.columns = rs.keys() Duration
>>> con.close() GPSstart
Importing Flat Files with numpy Observatory
Using the context manager with Type
Files with one data type UTCstart
>>> filename = ‘mnist.txt’ >>> with engine.connect() as con:
>>> print(data['meta']['Description'].value) Retrieve the value for a key
>>> data = np.loadtxt(filename, rs = con.execute("SELECT OrderID FROM Orders")
delimiter=',', String used to separate values df = pd.DataFrame(rs.fetchmany(size=5))
skiprows=2, Skip the first 2 lines df.columns = rs.keys()
usecols=[0,2], Read the 1st and 3rd column
Navigating Your FileSystem
dtype=str) The type of the resulting array Querying relational databases with pandas
Magic Commands
Files with mixed data types >>> df = pd.read_sql_query("SELECT * FROM Orders", engine)
>>> filename = 'titanic.csv' !ls List directory contents of files and directories
>>> data = np.genfromtxt(filename, %cd .. Change current working directory
delimiter=',', %pwd Return the current working directory path
names=True, Look for column header
Exploring Your Data
dtype=None)
NumPy Arrays os Library
>>> data_array = np.recfromcsv(filename) >>> data_array.dtype Data type of array elements >>> import os
>>> data_array.shape Array dimensions >>> path = "/usr/tmp"
The default dtype of the np.recfromcsv() function is None. >>> wd = os.getcwd() Store the name of current directory in a string
>>> len(data_array) Length of array
>>> os.listdir(wd) Output contents of the directory in a list
Importing Flat Files with pandas >>> os.chdir(path) Change current working directory
pandas DataFrames >>> os.rename("test1.txt", Rename a file
>>> filename = 'winequality-red.csv' "test2.txt")
>>> data = pd.read_csv(filename, >>> df.head() Return first DataFrame rows
nrows=5, >>> os.remove("test1.txt") Delete an existing file
Number of rows of file to read >>> df.tail() Return last DataFrame rows >>> os.mkdir("newdir") Create a new directory
header=None, Row number to use as col names >>> df.index Describe index
sep='\t', Delimiter to use >>> df.columns Describe DataFrame columns
comment='#', Character to split comments >>> df.info() Info on DataFrame
21

na_values=[""]) String to recognize as NA/NaN >>> data_array = data.values Convert a DataFrame to an a NumPy array DataCamp
Learn R for Data Science Interactively
Lists Also see NumPy Arrays Libraries
Python For Data Science Cheat Sheet >>> a = 'is' Import libraries
Python Basics >>> b = 'nice' >>> import numpy Data analysis Machine learning
Learn More Python for Data Science Interactively at www.datacamp.com >>> my_list = ['my', 'list', a, b] >>> import numpy as np
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] Selective import
>>> from math import pi Scientific computing 2D plo"ing
Variables and Data Types Selecting List Elements Index starts at 0
Subset Install Python
Variable Assignment
>>> my_list[1] Select item at index 1
>>> x=5 Select 3rd last item
>>> my_list[-3]
>>> x
Slice
5 >>> my_list[1:3] Select items at index 1 and 2
Calculations With Variables >>> my_list[1:] Select items a!er index 0
>>> my_list[:3] Select items before index 3 Leading open data science platform Free IDE that is included Create and share
>>> x+2 Sum of two variables powered by Python with Anaconda documents with live code,
>>> my_list[:] Copy my_list
7 visualizations, text, ...
>>> x-2 Subtraction of two variables
Subset Lists of Lists
>>> my_list2[1][0] my_list[list][itemOfList]
3 Numpy Arrays Also see Lists
>>> my_list2[1][:2]
>>> x*2 Multiplication of two variables
>>> my_list = [1, 2, 3, 4]
10 List Operations
>>> x**2 Exponentiation of a variable >>> my_array = np.array(my_list)
25 >>> my_list + my_list >>> my_2darray = np.array([[1,2,3],[4,5,6]])
>>> x%2 Remainder of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']
Selecting Numpy Array Elements Index starts at 0
1 >>> my_list * 2
>>> x/float(2) Division of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] Subset
2.5 >>> my_list2 > 4 >>> my_array[1] Select item at index 1
True 2
Types and Type Conversion Slice
List Methods >>> my_array[0:2] Select items at index 0 and 1
str() '5', '3.45', 'True' Variables to strings
my_list.index(a) Get the index of an item array([1, 2])
>>>
int() 5, 3, 1 Variables to integers >>> my_list.count(a) Count an item Subset 2D Numpy arrays
>>> my_list.append('!') Append an item at a time >>> my_2darray[:,0] my_2darray[rows, columns]
my_list.remove('!') Remove an item array([1, 4])
float() 5.0, 1.0 Variables to floats >>>
>>> del(my_list[0:1]) Remove an item Numpy Array Operations
bool() True, True, True Variables to booleans >>> my_list.reverse() Reverse the list
>>> my_array > 3
>>> my_list.extend('!') Append an item array([False, False, False, True], dtype=bool)
>>> my_list.pop(-1) Remove an item >>> my_array * 2
Asking For Help >>> my_list.insert(0,'!') Insert an item array([2, 4, 6, 8])
>>> help(str) >>> my_list.sort() Sort the list >>> my_array + np.array([5, 6, 7, 8])
array([6, 8, 10, 12])
Strings
>>> my_string = 'thisStringIsAwesome' Numpy Array Functions
String Operations Index starts at 0
>>> my_string >>> my_array.shape Get the dimensions of the array
'thisStringIsAwesome' >>> my_string[3] >>> np.append(other_array) Append items to an array
>>> my_string[4:9] >>> np.insert(my_array, 1, 5) Insert items in an array
String Operations >>> np.delete(my_array,[1]) Delete items in an array
String Methods >>> np.mean(my_array) Mean of the array
>>> my_string * 2
'thisStringIsAwesomethisStringIsAwesome' >>> my_string.upper() String to uppercase >>> np.median(my_array) Median of the array
>>> my_string + 'Innit' >>> my_string.lower() String to lowercase >>> my_array.corrcoef() Correlation coefficient
'thisStringIsAwesomeInnit' >>> my_string.count('w') Count String elements >>> np.std(my_array) Standard deviation
>>> 'm' in my_string >>> my_string.replace('e', 'i') Replace String elements
22

True >>> my_string.strip() Strip whitespaces DataCamp


Learn Python for Data Science Interactively
Inspecting Your Array Subse!ing, Slicing, Indexing Also see Lists
Python For Data Science Cheat Sheet >>> a.shape Array dimensions
>>> len(a) Length of array
Subse!ing
NumPy Basics >>> a[2] 1 2 3 Select the element at the 2nd index
>>> b.ndim Number of array dimensions 3
Learn Python for Data Science Interactively at www.DataCamp.com >>> e.size Number of array elements 1.5 2 3
>>> b[1,2] Select the element at row 1 column 2
>>> b.dtype Data type of array elements 6.0 4 5 6 (equivalent to b[1][2])
>>> b.dtype.name Name of data type
>>> b.astype(int) Convert an array to a different type Slicing
>>> a[0:2] 1 2 3 Select items at index 0 and 1
NumPy 2 array([1, 2])
Asking For Help >>> b[0:2,1] 1.5 2 3 Select items at rows 0 and 1 in column 1
The NumPy library is the core library for scientific computing in >>> np.info(np.ndarray.dtype) array([ 2., 5.]) 4 5 6
Python. It provides a high-performance multidimensional array
1.5 2 3
>>> b[:1] Select all items at row 0
object, and tools for working with these arrays. Array Mathematics array([[1.5, 2., 3.]]) 4 5 6 (equivalent to b[0:1, :])
Arithmetic Operations >>> c[1,...] Same as [1,:,:]
Use the following import convention: array([[[ 3., 2., 1.],
>>> import numpy as np [ 4., 5., 6.]]])
>>> g = a - b Subtraction
array([[-0.5, 0. , 0. ], >>> a[ : :-1] Reversed array a
NumPy Arrays array([3, 2, 1])
[-3. , -3. , -3. ]])
>>> np.subtract(a,b) Subtraction Boolean Indexing
1D array 2D array 3D array >>> a[a<2] 1 2 3 Select elements from a less than 2
>>> b + a Addition
array([[ 2.5, 4. , 6. ], array([1])
axis 1 axis 2
1 2 3 axis 1 [ 5. , 7. , 9. ]]) Fancy Indexing
1.5 2 3 >>> np.add(b,a) Addition >>> b[[1, 0, 1, 0],[0, 1, 2, 0]] Select elements (1,0),(0,1),(1,2) and (0,0)
axis 0 axis 0 array([ 4. , 2. , 6. , 1.5])
4 5 6 >>> a / b Division
array([[ 0.66666667, 1. , 1. ], >>> b[[1, 0, 1, 0]][:,[0,1,2,0]] Select a subset of the matrix’s rows
[ 0.25 , 0.4 , 0.5 ]]) array([[ 4. ,5. , 6. , 4. ], and columns
>>> np.divide(a,b) Division [ 1.5, 2. , 3. , 1.5],
[ 4. , 5. , 6. , 4. ],
Creating Arrays >>> a * b Multiplication [ 1.5, 2. , 3. , 1.5]])
array([[ 1.5, 4. , 9. ],
>>> a = np.array([1,2,3]) [ 4. , 10. , 18. ]])
>>> b = np.array([(1.5,2,3), (4,5,6)], dtype = float) >>> np.multiply(a,b) Multiplication Array Manipulation
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]], >>> np.exp(b) Exponentiation
dtype = float) >>> np.sqrt(b) Square root Transposing Array
>>> np.sin(a) Print sines of an array >>> i = np.transpose(b) Permute array dimensions
Initial Placeholders >>> np.cos(b) Element-wise cosine >>> i.T Permute array dimensions
>>> np.log(a) Element-wise natural logarithm
>>> np.zeros((3,4)) Create an array of zeros >>> e.dot(f) Dot product
Changing Array Shape
>>> np.ones((2,3,4),dtype=np.int16) Create an array of ones array([[ 7., 7.], >>> b.ravel() Fla"en the array
>>> d = np.arange(10,25,5) Create an array of evenly [ 7., 7.]]) >>> g.reshape(3,-2) Reshape, but don’t change data
spaced values (step value)
>>> np.linspace(0,2,9) Create an array of evenly Comparison Adding/Removing Elements
spaced values (number of samples) >>> h.resize((2,6)) Return a new array with shape (2,6)
>>> e = np.full((2,2),7) Create a constant array >>> a == b Element-wise comparison >>> np.append(h,g) Append items to an array
>>> f = np.eye(2) Create a 2X2 identity matrix array([[False, True, True], >>> np.insert(a, 1, 5) Insert items in an array
>>> np.random.random((2,2)) Create an array with random values [False, False, False]], dtype=bool) >>> np.delete(a,[1]) Delete items from an array
>>> np.empty((3,2)) Create an empty array >>> a < 2 Element-wise comparison
array([True, False, False], dtype=bool) Combining Arrays
>>> np.array_equal(a, b) Array-wise comparison >>> np.concatenate((a,d),axis=0) Concatenate arrays
array([ 1, 2, 3, 10, 15, 20])
I/O >>> np.vstack((a,b)) Stack arrays vertically (row-wise)
Aggregate Functions array([[ 1. , 2. , 3. ],
Saving & Loading On Disk [ 1.5, 2. , 3. ],
>>> a.sum() Array-wise sum [ 4. , 5. , 6. ]])
>>> np.save('my_array', a) >>> a.min() Array-wise minimum value >>> np.r_[e,f] Stack arrays vertically (row-wise)
>>> np.savez('array.npz', a, b) >>> b.max(axis=0) Maximum value of an array row >>> np.hstack((e,f)) Stack arrays horizontally (column-wise)
>>> np.load('my_array.npy') >>> b.cumsum(axis=1) Cumulative sum of the elements array([[ 7., 7., 1., 0.],
>>> a.mean() Mean [ 7., 7., 0., 1.]])
Saving & Loading Text Files >>> b.median() Median >>> np.column_stack((a,d)) Create stacked column-wise arrays
>>> a.corrcoef() Correlation coefficient array([[ 1, 10],
>>> np.loadtxt("myfile.txt") [ 2, 15],
>>> np.genfromtxt("my_file.csv", delimiter=',') >>> np.std(b) Standard deviation [ 3, 20]])
>>> np.savetxt("myarray.txt", a, delimiter=" ") >>> np.c_[a,d] Create stacked column-wise arrays
Copying Arrays Spli!ing Arrays
Data Types >>> np.hsplit(a,3) Split the array horizontally at the 3rd
>>> h = a.view() Create a view of the array with the same data
[array([1]),array([2]),array([3])] index
>>> np.int64 Signed 64-bit integer types >>> np.copy(a) Create a copy of the array
>>> h = a.copy() Create a deep copy of the array >>> np.vsplit(c,2) Split the array vertically at the 2nd index
>>> np.float32 Standard double-precision floating point [array([[[ 1.5, 2. , 1. ],
>>> np.complex Complex numbers represented by 128 floats [ 4. , 5. , 6. ]]]),
array([[[ 3., 2., 3.],
>>> np.bool Boolean type storing TRUE and FALSE values [ 4., 5., 6.]]])]
>>> np.object Python object type Sorting Arrays
>>> np.string_ Fixed-length string type >>> a.sort() Sort an array
23

>>> np.unicode_ Fixed-length unicode type >>> c.sort(axis=0) Sort the elements of an array's axis DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Asking For Help Dropping
>>> help(pd.Series.loc)
>>> s.drop(['a', 'c']) Drop values from rows (axis=0)
Pandas Basics Selection Also see NumPy Arrays >>> df.drop('Country', axis=1) Drop values from columns(axis=1)
Learn Python for Data Science Interactively at www.DataCamp.com
Ge!ing
>>> s['b'] Get one element Sort & Rank
-5
Pandas >>> df.sort_index() Sort by labels along an axis
>>> df[1:] Get subset of a DataFrame >>> df.sort_values(by='Country') Sort by the values along an axis
The Pandas library is built on NumPy and provides easy-to-use Country Capital Population >>> df.rank() Assign ranks to entries
data structures and data analysis tools for the Python 1 India New Delhi 1303171035
2 Brazil Brasília 207847528
programming language. Retrieving Series/DataFrame Information
Selecting, Boolean Indexing & Se!ing Basic Information
Use the following import convention: By Position >>> df.shape (rows,columns)
>>> import pandas as pd >>> df.iloc[[0],[0]] Select single value by row & >>> df.index Describe index
'Belgium' column >>> df.columns Describe DataFrame columns
Pandas Data Structures >>> df.info() Info on DataFrame
>>> df.iat([0],[0]) >>> df.count() Number of non-NA values
Series 'Belgium'
Summary
A one-dimensional labeled array a 3 By Label
>>> df.loc[[0], ['Country']] Select single value by row & >>> df.sum() Sum of values
capable of holding any data type b -5 >>> df.cumsum() Cummulative sum of values
'Belgium' column labels >>> df.min()/df.max() Minimum/maximum values
c 7 >>> df.at([0], ['Country']) >>> df.idxmin()/df.idxmax() Minimum/Maximum index value
Index
d 4 'Belgium' >>> df.describe() Summary statistics
>>> df.mean() Mean of values
By Label/Position >>> df.median() Median of values
>>> s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
>>> df.ix[2] Select single row of
DataFrame Country Brazil subset of rows
Capital Brasília
Applying Functions
Population 207847528 >>> f = lambda x: x*2
Columns
Country Capital Population A two-dimensional labeled >>> df.ix[:,'Capital'] Select a single column of >>> df.apply(f) Apply function
>>> df.applymap(f) Apply function element-wise
data structure with columns 0 Brussels subset of columns
0 Belgium Brussels 11190846 1 New Delhi
of potentially different types 2 Brasília Data Alignment
1 India New Delhi 1303171035
Index >>> df.ix[1,'Capital'] Select rows and columns
2 Brazil Brasília 207847528 Internal Data Alignment
'New Delhi'
NA values are introduced in the indices that don’t overlap:
Boolean Indexing
>>> data = {'Country': ['Belgium', 'India', 'Brazil'], >>> s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
>>> s[~(s > 1)] Series s where value is not >1
'Capital': ['Brussels', 'New Delhi', 'Brasília'], >>> s[(s < -1) | (s > 2)] s where value is <-1 or >2 >>> s + s3
'Population': [11190846, 1303171035, 207847528]} >>> df[df['Population']>1200000000] Use filter to adjust DataFrame a 10.0
b NaN
>>> df = pd.DataFrame(data, Se!ing
c 5.0
columns=['Country', 'Capital', 'Population']) >>> s['a'] = 6 Set index a of Series s to 6
d 7.0

I/O Arithmetic Operations with Fill Methods


You can also do the internal data alignment yourself with
Read and Write to CSV Read and Write to SQL Query or Database Table
the help of the fill methods:
>>> pd.read_csv('file.csv', header=None, nrows=5) >>> from sqlalchemy import create_engine >>> s.add(s3, fill_value=0)
>>> df.to_csv('myDataFrame.csv') >>> engine = create_engine('sqlite:///:memory:') a 10.0
>>> pd.read_sql("SELECT * FROM my_table;", engine) b -5.0
Read and Write to Excel c 5.0
>>> pd.read_sql_table('my_table', engine) d 7.0
>>> pd.read_excel('file.xlsx') >>> pd.read_sql_query("SELECT * FROM my_table;", engine) >>> s.sub(s3, fill_value=2)
>>> pd.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1') >>> s.div(s3, fill_value=4)
read_sql()is a convenience wrapper around read_sql_table() and
Read multiple sheets from the same file >>> s.mul(s3, fill_value=3)
read_sql_query()
>>> xlsx = pd.ExcelFile('file.xls')
24

>>> df = pd.read_excel(xlsx, 'Sheet1') >>> pd.to_sql('myDf', engine) DataCamp


Learn Python for Data Science Interactively
Advanced Indexing Also see NumPy Arrays Combining Data
Python For Data Science Cheat Sheet data1 data2
Selecting
Pandas >>> df3.loc[:,(df3>1).any()] Select cols with any vals >1 X1 X2 X1 X3
Learn Python for Data Science Interactively at www.DataCamp.com >>> df3.loc[:,(df3>1).all()] Select cols with vals > 1
>>> df3.loc[:,df3.isnull().any()] Select cols with NaN a 11.432 a 20.784
>>> df3.loc[:,df3.notnull().all()] Select cols without NaN b 1.303 b NaN
Indexing With isin c 99.906 d 20.784
>>> df[(df.Country.isin(df2.Type))] Find same elements
Reshaping Data >>> df3.filter(items=”a”,”b”]) Filter on values
>>> df.select(lambda x: not x%5) Select specific elements Merge
Pivot Where X1 X2 X3
>>> pd.merge(data1,
>>> df3= df2.pivot(index='Date', Spread rows into columns >>> s.where(s > 0) Subset the data data2, a 11.432 20.784
columns='Type', Query how='left',
values='Value') b 1.303 NaN
>>> df6.query('second > first') Query DataFrame on='X1')
c 99.906 NaN
Date Type Value

0 2016-03-01 a 11.432 Type a b c Se!ing/Rese!ing Index >>> pd.merge(data1, X1 X2 X3


1 2016-03-02 b 13.031 Date data2, a 11.432 20.784
>>> df.set_index('Country') Set the index
how='right',
2 2016-03-01 c 20.784 2016-03-01 11.432 NaN 20.784 >>> df4 = df.reset_index() Reset the index b 1.303 NaN
on='X1')
3 2016-03-03 a 99.906 >>> df = df.rename(index=str, Rename DataFrame d NaN 20.784
2016-03-02 1.303 13.031 NaN columns={"Country":"cntry",
4 2016-03-02 a 1.303 "Capital":"cptl", >>> pd.merge(data1,
2016-03-03 99.906 NaN 20.784 "Population":"ppltn"}) X1 X2 X3
5 2016-03-03 c 20.784 data2,
how='inner', a 11.432 20.784
Pivot Table Reindexing on='X1') b 1.303 NaN
>>> s2 = s.reindex(['a','c','d','e','b']) X1 X2 X3
>>> df4 = pd.pivot_table(df2, Spread rows into columns
values='Value', Forward Filling Backward Filling >>> pd.merge(data1,
index='Date', data2, a 11.432 20.784
columns='Type']) >>> df.reindex(range(4), >>> s3 = s.reindex(range(5), b 1.303 NaN
how='outer',
method='ffill') method='bfill') on='X1') c 99.906 NaN
Stack / Unstack Country Capital Population 0 3
0 Belgium Brussels 11190846 1 3 d NaN 20.784
>>> stacked = df5.stack() Pivot a level of column labels 1 India New Delhi 1303171035 2 3
>>> stacked.unstack() Pivot a level of index labels 2 Brazil Brasília 207847528 3 3 Join
3 Brazil Brasília 207847528 4 3
0 1 1 5 0 0.233482 >>> data1.join(data2, how='right')
1 5 0.233482 0.390959 1 0.390959 MultiIndexing Concatenate
2 4 0.184713 0.237102 2 4 0 0.184713
>>> arrays = [np.array([1,2,3]),
3 3 0.433522 0.429401 1 0.237102 np.array([5,4,3])] Vertical
>>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays) >>> s.append(s2)
Unstacked 3 3 0 0.433522
>>> tuples = list(zip(*arrays)) Horizontal/Vertical
1 0.429401 >>> index = pd.MultiIndex.from_tuples(tuples, >>> pd.concat([s,s2],axis=1, keys=['One','Two'])
Stacked names=['first', 'second']) >>> pd.concat([data1, data2], axis=1, join='inner')
>>> df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
Melt >>> df2.set_index(["Date", "Type"])
>>> pd.melt(df2, Gather columns into rows
Dates
id_vars=["Date"], >>> df2['Date']= pd.to_datetime(df2['Date'])
value_vars=["Type", "Value"],
Duplicate Data >>> df2['Date']= pd.date_range('2000-1-1',
value_name="Observations") >>> s3.unique() Return unique values periods=6,
>>> df2.duplicated('Type') Check duplicates freq='M')
Date Variable Observations >>> dates = [datetime(2012,5,1), datetime(2012,5,2)]
Date Type Value >>> df2.drop_duplicates('Type', keep='last') Drop duplicates
0 2016-03-01 Type a >>> index = pd.DatetimeIndex(dates)
0 2016-03-01 a 11.432 1 2016-03-02 Type b
>>> df.index.duplicated() Check index duplicates >>> index = pd.date_range(datetime(2012,2,1), end, freq='BM')
1 2016-03-02 b 13.031 2 2016-03-01 Type c
2 2016-03-01 c 20.784 3 2016-03-03 Type a Grouping Data Also see Matplotlib
4 2016-03-02 Type a
Visualization
3 2016-03-03 a 99.906
5 2016-03-03 Type c Aggregation >>> import matplotlib.pyplot as plt
4 2016-03-02 a 1.303 >>> df2.groupby(by=['Date','Type']).mean()
6 2016-03-01 Value 11.432 >>> s.plot() >>> df2.plot()
>>> df4.groupby(level=0).sum()
5 2016-03-03 c 20.784 7 2016-03-02 Value 13.031 >>> df4.groupby(level=0).agg({'a':lambda x:sum(x)/len(x), >>> plt.show() >>> plt.show()
8 2016-03-01 Value 20.784 'b': np.sum})
9 2016-03-03 Value 99.906 Transformation
>>> customSum = lambda x: (x+x%2)
10 2016-03-02 Value 1.303
>>> df4.groupby(level=0).transform(customSum)
11 2016-03-03 Value 20.784

Iteration Missing Data


>>> df.dropna() Drop NaN values
>>> df.iteritems() (Column-index, Series) pairs >>> df3.fillna(df3.mean()) Fill NaN values with a predetermined value
25

>>> df.iterrows() (Row-index, Series) pairs >>> df2.replace("a", "f") Replace values with others
DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Plot Anatomy & Workflow
Plot Anatomy Workflow
Matplotlib Axes/Subplot The basic steps to creating plots with matplotlib are:
Learn Python Interactively at www.DataCamp.com 1 Prepare data 2 Create plot 3 Plot 4 Customize plot 5 Save plot 6 Show plot
>>> import matplotlib.pyplot as plt
>>> x = [1,2,3,4] Step 1
>>> y = [10,20,25,30]
>>> fig = plt.figure() Step 2
Matplotlib Y-axis Figure >>> ax = fig.add_subplot(111) Step 3
>>> ax.plot(x, y, color='lightblue', linewidth=3) Step 3, 4
Matplotlib is a Python 2D plo!ing library which produces >>> ax.scatter([2,4,6],
publication-quality figures in a variety of hardcopy formats [5,15,25],
color='darkgreen',
and interactive environments across marker='^')
platforms. >>> ax.set_xlim(1, 6.5)
X-axis
>>> plt.savefig('foo.png')
>>> plt.show() Step 6
1 Prepare The Data Also see Lists & NumPy
1D Data 4 Customize Plot
>>> import numpy as np Colors, Color Bars & Color Maps Mathtext
>>> x = np.linspace(0, 10, 100)
>>> y = np.cos(x) >>> plt.plot(x, x, x, x**2, x, x**3) >>> plt.title(r'$sigma_i=15$', fontsize=20)
>>> z = np.sin(x) >>> ax.plot(x, y, alpha = 0.4)
>>> ax.plot(x, y, c='k') Limits, Legends & Layouts
2D Data or Images >>> fig.colorbar(im, orientation='horizontal')
>>> im = ax.imshow(img, Limits & Autoscaling
>>> data = 2 * np.random.random((10, 10)) cmap='seismic')
>>> data2 = 3 * np.random.random((10, 10)) >>> ax.margins(x=0.0,y=0.1) Add padding to a plot
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j] >>> ax.axis('equal') Set the aspect ratio of the plot to 1
Markers >>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) Set limits for x-and y-axis
>>> U = -1 - X**2 + Y
>>> V = 1 + X - Y**2 >>> fig, ax = plt.subplots() >>> ax.set_xlim(0,10.5) Set limits for x-axis
>>> from matplotlib.cbook import get_sample_data >>> ax.scatter(x,y,marker=".") Legends
>>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy')) >>> ax.plot(x,y,marker="o") >>> ax.set(title='An Example Axes', Set a title and x-and y-axis labels
ylabel='Y-Axis',
Linestyles xlabel='X-Axis')
Create Plot >>> ax.legend(loc='best') No overlapping plot elements
2 >>> plt.plot(x,y,linewidth=4.0) Ticks
>>> import matplotlib.pyplot as plt >>> plt.plot(x,y,ls='solid') >>> ax.xaxis.set(ticks=range(1,5), Manually set x-ticks
>>> plt.plot(x,y,ls='--') ticklabels=[3,100,-12,"foo"])
Figure >>> plt.plot(x,y,'--',x**2,y**2,'-.') >>> ax.tick_params(axis='y', Make y-ticks longer and go in and out
>>> plt.setp(lines,color='r',linewidth=4.0) direction='inout',
>>> fig = plt.figure() length=10)
>>> fig2 = plt.figure(figsize=plt.figaspect(2.0)) Text & Annotations
Subplot Spacing
Axes >>> ax.text(1, >>> fig3.subplots_adjust(wspace=0.5, Adjust the spacing between subplots
-2.1, hspace=0.3,
All plo!ing is done with respect to an Axes. In most cases, a 'Example Graph', left=0.125,
style='italic') right=0.9,
subplot will fit your needs. A subplot is an axes on a grid system. >>> ax.annotate("Sine", top=0.9,
>>> fig.add_axes() xy=(8, 0), bottom=0.1)
>>> ax1 = fig.add_subplot(221) # row-col-num xycoords='data', >>> fig.tight_layout() Fit subplot(s) in to the figure area
xytext=(10.5, 0),
>>> ax3 = fig.add_subplot(212) textcoords='data', Axis Spines
>>> fig3, axes = plt.subplots(nrows=2,ncols=2) arrowprops=dict(arrowstyle="->", >>> ax1.spines['top'].set_visible(False) Make the top axis line for a plot invisible
>>> fig4, axes2 = plt.subplots(ncols=3) connectionstyle="arc3"),) >>> ax1.spines['bottom'].set_position(('outward',10)) Move the bo!om axis line outward

3 Plo!ing Routines 5 Save Plot


1D Data Vector Fields Save figures
>>> plt.savefig('foo.png')
>>> fig, ax = plt.subplots() >>> axes[0,1].arrow(0,0,0.5,0.5) Add an arrow to the axes
>>> lines = ax.plot(x,y) Draw points with lines or markers connecting them >>> axes[1,1].quiver(y,z) Plot a 2D field of arrows Save transparent figures
>>> ax.scatter(x,y) Draw unconnected points, scaled or colored >>> axes[0,1].streamplot(X,Y,U,V) Plot a 2D field of arrows >>> plt.savefig('foo.png', transparent=True)
>>> axes[0,0].bar([1,2,3],[3,4,5]) Plot vertical rectangles (constant width)
>>> axes[1,0].barh([0.5,1,2.5],[0,1,2]) Plot horiontal rectangles (constant height)
>>> axes[1,1].axhline(0.45) Draw a horizontal line across axes Data Distributions Show Plot
Draw a vertical line across axes
>>> axes[0,1].axvline(0.65) >>> ax1.hist(y) Plot a histogram
6 >>> plt.show()
>>> ax.fill(x,y,color='blue') Draw filled polygons >>> ax3.boxplot(y) Make a box and whisker plot
>>> ax.fill_between(x,y,color='yellow') Fill between y-values and 0 >>> ax3.violinplot(z) Make a violin plot
2D Data or Images Close & Clear
>>> fig, ax = plt.subplots() >>> plt.cla() Clear an axis
>>> axes2[0].pcolor(data2) Pseudocolor plot of 2D array >>> plt.clf() Clear the entire figure
>>> im = ax.imshow(img, Colormapped or RGB arrays >>> axes2[0].pcolormesh(data) Pseudocolor plot of 2D array
cmap='gist_earth', >>> plt.close() Close a window
interpolation='nearest', >>> CS = plt.contour(Y,X,U) Plot contours
>>> axes2[2].contourf(data1) Plot filled contours
26

vmin=-2, DataCamp
vmax=2) >>> axes2[2]= ax.clabel(CS) Label a contour plot
Learn Python for Data Science Interactively
Matplotlib 2.0.0 - Updated on: 02/2017
Python For Data Science Cheat Sheet 3 Plo!ing With Seaborn
Seaborn Axis Grids
Learn Data Science Interactively at www.DataCamp.com >>> g = sns.FacetGrid(titanic, Subplot grid for plo!ing conditional >>> h = sns.PairGrid(iris) Subplot grid for plo!ing pairwise
col="survived", relationships >>> h = h.map(plt.scatter) relationships
row="sex") >>> sns.pairplot(iris) Plot pairwise bivariate distributions
>>> g = g.map(plt.hist,"age") >>> i = sns.JointGrid(x="x", Grid for bivariate plot with marginal
>>> sns.factorplot(x="pclass", Draw a categorical plot onto a y="y", univariate plots
y="survived", Facetgrid data=data)
hue="sex", >>> i = i.plot(sns.regplot,
Statistical Data Visualization With Seaborn data=titanic) sns.distplot)
The Python visualization library Seaborn is based on >>> sns.lmplot(x="sepal_width", Plot data and regression model fits >>> sns.jointplot("sepal_length", Plot bivariate distribution
y="sepal_length", across a FacetGrid "sepal_width",
matplotlib and provides a high-level interface for drawing hue="species", data=iris,
a!ractive statistical graphics. data=iris) kind='kde')

Categorical Plots Regression Plots


Make use of the following aliases to import the libraries: >>> sns.regplot(x="sepal_width", Plot data and a linear regression
Sca!erplot
>>> import matplotlib.pyplot as plt y="sepal_length", model fit
>>> sns.stripplot(x="species", Sca!erplot with one
>>> import seaborn as sns data=iris,
y="petal_length", categorical variable
data=iris) ax=ax)
The basic steps to creating plots with Seaborn are: >>> sns.swarmplot(x="species", Categorical sca!erplot with Distribution Plots
y="petal_length", non-overlapping points
1. Prepare some data data=iris) >>> plot = sns.distplot(data.y, Plot univariate distribution
2. Control figure aesthetics Bar Chart kde=False,
color="b")
3. Plot with Seaborn >>> sns.barplot(x="sex", Show point estimates and
y="survived", confidence intervals with Matrix Plots
4. Further customize your plot hue="class", sca!erplot glyphs
>>> sns.heatmap(uniform_data,vmin=0,vmax=1) Heatmap
data=titanic)
>>> import matplotlib.pyplot as plt Count Plot
>>> import seaborn as sns
>>> sns.countplot(x="deck", Show count of observations Also see Matplotlib
>>> tips = sns.load_dataset("tips") Step 1 4 Further Customizations
data=titanic,
>>> sns.set_style("whitegrid") Step 2
palette="Greens_d")
>>> g = sns.lmplot(x="tip", Step 3 Axisgrid Objects
Point Plot
y="total_bill",
data=tips, >>> sns.pointplot(x="class", Show point estimates and >>> g.despine(left=True) Remove le# spine
aspect=2) y="survived", confidence intervals as >>> g.set_ylabels("Survived") Set the labels of the y-axis
>>> g = (g.set_axis_labels("Tip","Total bill(USD)"). hue="sex", rectangular bars >>> g.set_xticklabels(rotation=45) Set the tick labels for x
set(xlim=(0,10),ylim=(0,100))) data=titanic, >>> g.set_axis_labels("Survived", Set the axis labels
Step 4 palette={"male":"g", "Sex")
>>> plt.title("title")
>>> plt.show(g) Step 5 "female":"m"}, >>> h.set(xlim=(0,5), Set the limit and ticks of the
markers=["^","o"], ylim=(0,5), x-and y-axis
linestyles=["-","--"]) xticks=[0,2.5,5],
Boxplot yticks=[0,2.5,5])
1 Data Also see Lists, NumPy & Pandas >>> sns.boxplot(x="alive", Boxplot
y="age", Plot
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> plt.title("A Title") Add plot title
data=titanic)
>>> uniform_data = np.random.rand(10, 12) >>> plt.ylabel("Survived") Adjust the label of the y-axis
>>> sns.boxplot(data=iris,orient="h") Boxplot with wide-form data
>>> data = pd.DataFrame({'x':np.arange(1,101), >>> plt.xlabel("Sex") Adjust the label of the x-axis
'y':np.random.normal(0,4,100)}) Violinplot >>> plt.ylim(0,100) Adjust the limits of the y-axis
>>> sns.violinplot(x="age", Violin plot >>> plt.xlim(0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> plt.setp(ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)

5 Show or Save Plot Also see Matplotlib


Figure Aesthetics Also see Matplotlib
2 >>> plt.show() Show the plot
Context Functions >>> plt.savefig("foo.png") Save the plot as a figure
>>> f, ax = plt.subplots(figsize=(5,6)) Create a figure and one subplot >>> plt.savefig("foo.png", Save transparent figure
>>> sns.set_context("talk") Set context to "talk" transparent=True)
>>> sns.set_context("notebook", Set context to "notebook",
Seaborn styles font_scale=1.5, scale font elements and
rc={"lines.linewidth":2.5}) override param mapping Close & Clear Also see Matplotlib
>>> sns.set() (Re)set the seaborn default
>>> sns.set_style("whitegrid") Set the matplotlib parameters Color Pale!e >>> plt.cla() Clear an axis
>>> sns.set_style("ticks", Set the matplotlib parameters >>> plt.clf() Clear an entire figure
{"xtick.major.size":8, >>> sns.set_palette("husl",3) Define the color pale!e >>> plt.close() Close a window
"ytick.major.size":8}) >>> sns.color_palette("husl") Use with with to temporarily set pale!e
>>> sns.axes_style("whitegrid") Return a dict of params or use with >>> flatui = ["#9b59b6","#3498db","#95a5a6","#e74c3c","#34495e","#2ecc71"]
27

with to temporarily set the style >>> sns.set_palette(flatui) Set your own color pale!e DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet 3 Renderers & Visual Customizations
Bokeh Glyphs Grid Layout
Learn Bokeh Interactively at www.DataCamp.com, Sca!er Markers >>> from bokeh.layouts import gridplot
taught by Bryan Van de Ven, core contributor >>> p1.circle(np.array([1,2,3]), np.array([3,2,1]), >>> row1 = [p1,p2]
fill_color='white') >>> row2 = [p3]
>>> p2.square(np.array([1.5,3.5,5.5]), [1,4,3], >>> layout = gridplot([[p1,p2],[p3]])
color='blue', size=1)
Line Glyphs Tabbed Layout
Plo!ing With Bokeh >>> p1.line([1,2,3,4], [3,4,5,6], line_width=2)
>>> p2.multi_line(pd.DataFrame([[1,2,3],[5,6,7]]), >>> from bokeh.models.widgets import Panel, Tabs
The Python interactive visualization library Bokeh >>> tab1 = Panel(child=p1, title="tab1")
pd.DataFrame([[3,4,5],[3,2,1]]),
enables high-performance visual presentation of color="blue") >>> tab2 = Panel(child=p2, title="tab2")
>>> layout = Tabs(tabs=[tab1, tab2])
large datasets in modern web browsers.
Customized Glyphs Also see Data
Linked Plots
Bokeh’s mid-level general purpose bokeh.plotting Selection and Non-Selection Glyphs
>>> p = figure(tools='box_select') Linked Axes
interface is centered around two main components: data >>> p.circle('mpg', 'cyl', source=cds_df, >>> p2.x_range = p1.x_range
and glyphs. selection_color='red', >>> p2.y_range = p1.y_range
nonselection_alpha=0.1) Linked Brushing
>>> p4 = figure(plot_width = 100,
+ = Hover Glyphs tools='box_select,lasso_select')
>>> from bokeh.models import HoverTool
>>> p4.circle('mpg', 'cyl', source=cds_df)
data glyphs plot >>> hover = HoverTool(tooltips=None, mode='vline')
>>> p5 = figure(plot_width = 200,
>>> p3.add_tools(hover)
tools='box_select,lasso_select')
The basic steps to creating plots with the bokeh.plotting >>> p5.circle('mpg', 'hp', source=cds_df)
interface are: >>> layout = row(p4,p5)
US
Asia
Europe
Colormapping
1. Prepare some data: >>> from bokeh.models import CategoricalColorMapper
Python lists, NumPy arrays, Pandas DataFrames and other sequences of values >>> color_mapper = CategoricalColorMapper(
factors=['US', 'Asia', 'Europe'], 4 Output & Export
2. Create a new plot palette=['blue', 'red', 'green'])
3. Add renderers for your data, with visual customizations >>> p3.circle('mpg', 'cyl', source=cds_df, Notebook
color=dict(field='origin',
4. Specify where to generate the output transform=color_mapper), >>> from bokeh.io import output_notebook, show
5. Show or save the results legend='Origin') >>> output_notebook()
>>> from bokeh.plotting import figure
>>> from bokeh.io import output_file, show Legend Location HTML
>>> x = [1, 2, 3, 4, 5] Step 1
>>> y = [6, 7, 2, 4, 5] Inside Plot Area Standalone HTML
Step 2 >>> p.legend.location = 'bottom_left' >>> from bokeh.embed import file_html
>>> p = figure(title="simple line example",
>>> from bokeh.resources import CDN
x_axis_label='x',
>>> html = file_html(p, CDN, "my_plot")
y_axis_label='y') Outside Plot Area
>>> p.line(x, y, legend="Temp.", line_width=2) Step 3 >>> from bokeh.models import Legend
>>> r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1]) >>> from bokeh.io import output_file, show
>>> output_file("lines.html") Step 4 >>> r2 = p2.line([1,2,3,4], [3,4,5,6]) >>> output_file('my_bar_chart.html', mode='cdn')
>>> show(p) Step 5 >>> legend = Legend(items=[("One" ,[p1, r1]),("Two",[r2])],
location=(0, -30)) Components
>>> p.add_layout(legend, 'right')
>>> from bokeh.embed import components
1 Data Also see Lists, NumPy & Pandas >>> script, div = components(p)
Under the hood, your data is converted to Column Data Legend Orientation
Sources. You can also do this manually: >>> p.legend.orientation = "horizontal" PNG
>>> import numpy as np >>> p.legend.orientation = "vertical"
>>> from bokeh.io import export_png
>>> import pandas as pd >>> export_png(p, filename="plot.png")
>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'], Legend Background & Border
[32.4,4,66, 'Asia'],
[21.4,4,109, 'Europe']]), >>> p.legend.border_line_color = "navy" SVG
columns=['mpg','cyl', 'hp', 'origin'], >>> p.legend.background_fill_color = "white"
index=['Toyota', 'Fiat', 'Volvo']) >>> from bokeh.io import export_svgs
>>> from bokeh.models import ColumnDataSource Rows & Columns Layout >>> p.output_backend = "svg"
>>> export_svgs(p, filename="plot.svg")
>>> cds_df = ColumnDataSource(df) Rows
>>> from bokeh.layouts import row
>>> layout = row(p1,p2,p3)
2 Plo!ing 5 Show or Save Your Plots
Columns
>>> from bokeh.plotting import figure >>> from bokeh.layouts import columns >>> show(p1) >>> show(layout)
>>> p1 = figure(plot_width=300, tools='pan,box_zoom') >>> layout = column(p1,p2,p3) >>> save(p1) >>> save(layout)
>>> p2 = figure(plot_width=300, plot_height=300, Nesting Rows & Columns
28

x_range=(0, 8), y_range=(0, 8)) >>>layout = row(column(p1,p2), p3) DataCamp


>>> p3 = figure() Learn Python for Data Science Interactively
Linear Algebra Also see NumPy
Python For Data Science Cheat Sheet
You’ll use the linalg and sparse modules. Note that scipy.linalg contains and expands on numpy.linalg.
SciPy - Linear Algebra >>> from scipy import linalg, sparse Matrix Functions
Learn More Python for Data Science Interactively at www.datacamp.com
Creating Matrices Addition
>>> np.add(A,D) Addition
>>> A = np.matrix(np.random.random((2,2)))
SciPy >>> B = np.asmatrix(b) Subtraction
>>> C = np.mat(np.random.random((10,5))) >>> np.subtract(A,D) Subtraction
The SciPy library is one of the core packages for >>> D = np.mat([[3,4], [5,6]]) Division
scientific computing that provides mathematical >>> np.divide(A,D) Division
Basic Matrix Routines Multiplication
algorithms and convenience functions built on the
>>> np.multiply(D,A) Multiplication
NumPy extension of Python. Inverse >>> np.dot(A,D) Dot product
>>> A.I Inverse >>> np.vdot(A,D) Vector dot product
>>> linalg.inv(A) Inverse
Interacting With NumPy Also see NumPy >>> A.T Tranpose matrix >>> np.inner(A,D) Inner product
>>> np.outer(A,D) Outer product
>>> import numpy as np >>> A.H Conjugate transposition >>> np.tensordot(A,D) Tensor dot product
>>> a = np.array([1,2,3]) >>> np.trace(A) Trace >>> np.kron(A,D) Kronecker product
>>> b = np.array([(1+5j,2j,3j), (4j,5j,6j)])
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]]) Norm Exponential Functions
>>> linalg.norm(A) Frobenius norm >>> linalg.expm(A) Matrix exponential
Index Tricks >>> linalg.norm(A,1) L1 norm (max column sum) >>> linalg.expm2(A) Matrix exponential (Taylor Series)
>>> linalg.norm(A,np.inf) L inf norm (max row sum) >>> linalg.expm3(D) Matrix exponential (eigenvalue
>>> np.mgrid[0:5,0:5] Create a dense meshgrid decomposition)
>>> np.ogrid[0:2,0:2] Create an open meshgrid Rank Logarithm Function
>>> np.r_[[3,[0]*5,-1:1:10j] Stack arrays vertically (row-wise) >>> np.linalg.matrix_rank(C) Matrix rank >>> linalg.logm(A) Matrix logarithm
>>> np.c_[b,c] Create stacked column-wise arrays Determinant Trigonometric Tunctions
>>> linalg.det(A) Determinant >>> linalg.sinm(D) Matrix sine
Shape Manipulation Solving linear problems >>> linalg.cosm(D) Matrix cosine
>>> np.transpose(b) Permute array dimensions >>> linalg.solve(A,b) Solver for dense matrices >>> linalg.tanm(A) Matrix tangent
>>> b.flatten() Fla!en the array >>> E = np.mat(a).T Solver for dense matrices Hyperbolic Trigonometric Functions
>>> np.hstack((b,c)) Stack arrays horizontally (column-wise) >>> linalg.lstsq(D,E) Least-squares solution to linear matrix >>> linalg.sinhm(D) Hypberbolic matrix sine
>>> np.vstack((a,b)) Stack arrays vertically (row-wise) equation >>> linalg.coshm(D) Hyperbolic matrix cosine
>>> np.hsplit(c,2) Split the array horizontally at the 2nd index Generalized inverse >>> linalg.tanhm(A) Hyperbolic matrix tangent
>>> np.vpslit(d,2) Split the array vertically at the 2nd index >>> linalg.pinv(C) Compute the pseudo-inverse of a matrix Matrix Sign Function
(least-squares solver) >>> np.sigm(A) Matrix sign function
Polynomials >>> linalg.pinv2(C) Compute the pseudo-inverse of a matrix
>>> from numpy import poly1d (SVD) Matrix Square Root
>>> linalg.sqrtm(A) Matrix square root
>>> p = poly1d([3,4,5]) Create a polynomial object
Creating Sparse Matrices Arbitrary Functions
Vectorizing Functions >>> linalg.funm(A, lambda x: x*x) Evaluate matrix function
>>> F = np.eye(3, k=1) Create a 2X2 identity matrix
>>> def myfunc(a):
if a < 0: >>> G = np.mat(np.identity(2)) Create a 2x2 identity matrix Decompositions
return a*2 >>> C[C > 0.5] = 0
else: >>> H = sparse.csr_matrix(C) Compressed Sparse Row matrix
return a/2 Eigenvalues and Eigenvectors
>>> I = sparse.csc_matrix(D) Compressed Sparse Column matrix >>> la, v = linalg.eig(A) Solve ordinary or generalized
>>> np.vectorize(myfunc) Vectorize functions >>> J = sparse.dok_matrix(A) Dictionary Of Keys matrix eigenvalue problem for square matrix
>>> E.todense() Sparse matrix to full matrix >>> l1, l2 = la Unpack eigenvalues
Type Handling >>> sparse.isspmatrix_csc(A) Identify sparse matrix >>> v[:,0] First eigenvector
>>> v[:,1] Second eigenvector
>>> np.real(c) Return the real part of the array elements >>> linalg.eigvals(A) Unpack eigenvalues
>>> np.imag(c) Return the imaginary part of the array elements Sparse Matrix Routines
>>> np.real_if_close(c,tol=1000) Return a real array if complex parts close to 0 Singular Value Decomposition
>>> np.cast['f'](np.pi) Cast object to a data type Inverse >>> U,s,Vh = linalg.svd(B) Singular Value Decomposition (SVD)
>>> sparse.linalg.inv(I) Inverse >>> M,N = B.shape
Other Useful Functions Norm >>> Sig = linalg.diagsvd(s,M,N) Construct sigma matrix in SVD
>>> sparse.linalg.norm(I) Norm LU Decomposition
>>> np.angle(b,deg=True) Return the angle of the complex argument >>> P,L,U = linalg.lu(C) LU Decomposition
>>> g = np.linspace(0,np.pi,num=5) Create an array of evenly spaced values
Solving linear problems
(number of samples) >>> sparse.linalg.spsolve(H,I) Solver for sparse matrices
>>> g [3:] += np.pi
>>> np.unwrap(g) Unwrap Sparse Matrix Decompositions
>>> np.logspace(0,10,3) Create an array of evenly spaced values (log scale) Sparse Matrix Functions
>>> la, v = sparse.linalg.eigs(F,1) Eigenvalues and eigenvectors
>>> np.select([c<4],[c*2]) Return values from a list of arrays depending on >>> sparse.linalg.expm(I) Sparse matrix exponential >>> sparse.linalg.svds(H, 2) SVD
conditions
>>> misc.factorial(a) Factorial
>>> misc.comb(10,3,exact=True) Combine N things taken at k time
>>> misc.central_diff_weights(3) Weights for Np-point central derivative Asking For Help DataCamp
29

>>> misc.derivative(myfunc,1.0) Find the n-th derivative of a function at a point >>> help(scipy.linalg.diagsvd)
>>> np.info(np.matrix) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Duplicate Values GroupBy
>>> df = df.dropDuplicates() >>> df.groupBy("age")\ Group by age, count the members
PySpark - SQL Basics .count() \ in the groups
.show()
Learn Python for data science Interactively at www.DataCamp.com Queries
>>> from pyspark.sql import functions as F Filter
Select
>>> df.select("firstName").show() Show all entries in firstName column >>> df.filter(df["age"]>24).show() Filter entries of age, only keep those
>>> df.select("firstName","lastName") \ records of which the values are >24
PySpark & Spark SQL .show()
>>> df.select("firstName", Show all entries in firstName, age
Spark SQL is Apache Spark's module for "age", and type
explode("phoneNumber") \ Sort
working with structured data. .alias("contactInfo")) \
.select("contactInfo.type", >>> peopledf.sort(peopledf.age.desc()).collect()
"firstName", >>> df.sort("age", ascending=False).collect()
Initializing SparkSession "age") \ >>> df.orderBy(["age","city"],ascending=[0,1])\
A SparkSession can be used create DataFrame, register DataFrame as tables, .show() .collect()
execute SQL over tables, cache tables, and read parquet files. >>> df.select(df["firstName"],df["age"]+ 1) Show all entries in firstName and age,
.show() add 1 to the entries of age
>>> from pyspark.sql import SparkSession >>> df.select(df['age'] > 24).show() Show all entries where age >24
>>> spark = SparkSession \ When
Missing & Replacing Values
.builder \ >>> df.select("firstName", Show firstName and 0 or 1 depending
.appName("Python Spark SQL basic example") \ >>> df.na.fill(50).show() Replace null values
F.when(df.age > 30, 1) \ on age >30 >>> df.na.drop().show() Return new df omi!ing rows with null values
.config("spark.some.config.option", "some-value") \ .otherwise(0)) \
.getOrCreate() >>> df.na \ Return new df replacing one value with
.show() .replace(10, 20) \ another
>>> df[df.firstName.isin("Jane","Boris")] Show firstName if in the given options .show()
.collect()
Creating DataFrames Like
>>> df.select("firstName", Show firstName, and lastName is Repartitioning
From RDDs df.lastName.like("Smith")) \ TRUE if lastName is like Smith
.show()
>>> from pyspark.sql.types import * Startswith - Endswith >>> df.repartition(10)\ df with 10 partitions
>>> df.select("firstName", Show firstName, and TRUE if .rdd \
Infer Schema .getNumPartitions()
>>> sc = spark.sparkContext df.lastName \ lastName starts with Sm
.startswith("Sm")) \ >>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition
>>> lines = sc.textFile("people.txt")
.show()
>>> parts = lines.map(lambda l: l.split(",")) >>> df.select(df.lastName.endswith("th")) \ Show last names ending in th
>>> people = parts.map(lambda p: Row(name=p[0],age=int(p[1]))) .show() Running SQL Queries Programmatically
>>> peopledf = spark.createDataFrame(people) Substring
Specify Schema >>> df.select(df.firstName.substr(1, 3) \ Return substrings of firstName Registering DataFrames as Views
>>> people = parts.map(lambda p: Row(name=p[0], .alias("name")) \
age=int(p[1].strip()))) .collect() >>> peopledf.createGlobalTempView("people")
>>> schemaString = "name age" Between >>> df.createTempView("customer")
>>> fields = [StructField(field_name, StringType(), True) for >>> df.select(df.age.between(22, 24)) \ Show age: values are TRUE if between >>> df.createOrReplaceTempView("customer")
field_name in schemaString.split()] .show() 22 and 24
>>> schema = StructType(fields) Query Views
>>> spark.createDataFrame(people, schema).show()
+--------+---+ Add, Update & Remove Columns
| name|age| >>> df5 = spark.sql("SELECT * FROM customer").show()
+--------+---+ >>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
| Mine| 28| Adding Columns
| Filip| 29| .show()
|Jonathan| 30|
+--------+---+ >>> df = df.withColumn('city',df.address.city) \
.withColumn('postalCode',df.address.postalCode) \
.withColumn('state',df.address.state) \ Output
From Spark Data Sources .withColumn('streetAddress',df.address.streetAddress) \
.withColumn('telePhoneNumber', Data Structures
JSON explode(df.phoneNumber.number)) \
>>> df = spark.read.json("customer.json") .withColumn('telePhoneType',
>>> df.show() >>> rdd1 = df.rdd Convert df into an RDD
+--------------------+---+---------+--------+--------------------+ explode(df.phoneNumber.type)) >>> df.toJSON().first() Convert df into a RDD of string
| address|age|firstName |lastName| phoneNumber|
+--------------------+---+---------+--------+--------------------+ >>> df.toPandas() Return the contents of df as Pandas
|[New York,10021,N...| 25| John| Smith|[[212 555-1234,ho...| Updating Columns DataFrame
|[New York,10021,N...| 21| Jane| Doe|[[322 888-1234,ho...|
+--------------------+---+---------+--------+--------------------+
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') Write & Save to Files
>>> df2 = spark.read.load("people.json", format="json")
Parquet files Removing Columns >>> df.select("firstName", "city")\
>>> df3 = spark.read.load("users.parquet") .write \
TXT files >>> df = df.drop("address", "phoneNumber") .save("nameAndCity.parquet")
>>> df4 = spark.read.text("people.txt") >>> df = df.drop(df.address).drop(df.phoneNumber) >>> df.select("firstName", "age") \
.write \
.save("namesAndAges.json",format="json")
Inspect Data
>>> df.dtypes Return df column names and data types >>> df.describe().show() Compute summary statistics Stopping SparkSession
>>> df.show() Display the content of df >>> df.columns Return the columns of df
>>> df.count() Count the number of rows in df >>> spark.stop()
>>> df.head() Return first n rows
>>> df.first() Return first row >>> df.distinct().count() Count the number of distinct rows in df
>>> df.printSchema() Print the schema of df
30

>>> df.take(2) Return the first n rows DataCamp


>>> df.schema Return the schema of df >>> df.explain() Print the (logical and physical) plans
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data
Basic Information Reducing
PySpark - RDD Basics >>> rdd.reduceByKey(lambda x,y : x+y) Merge the rdd values for
>>> rdd.getNumPartitions() List the number of partitions .collect() each key
Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count() Count RDD instances [('a',9),('b',2)]
3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values
>>> rdd.countByKey() Count RDD instances by key ('a',7,'a',2,'b',2)
defaultdict(<type 'int'>,{'a':2,'b':1}) Grouping by
>>> rdd.countByValue() Count RDD instances by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values
Spark defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1}) .mapValues(list)
>>> rdd.collectAsMap() Return (key,value) pairs as a .collect()
PySpark is the Spark Python API that exposes {'a': 2,'b': 2} dictionary >>> rdd.groupByKey() Group rdd by key
>>> rdd3.sum() Sum of RDD elements .mapValues(list)
the Spark programming model to Python. 4950 .collect()
>>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])]
True
Initializing Spark Aggregating
Summary >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
SparkContext >>> rdd3.max() Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each
99 (4950,100) partition and then the results
>>> from pyspark import SparkContext >>> rdd3.min() Minimum value of RDD elements
>>> sc = SparkContext(master = 'local[2]') >>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key
0
>>> rdd3.mean() Mean value of RDD elements .collect()
Inspect SparkContext 49.5 [('a',(9,2)), ('b',(2,1))]
>>> rdd3.stdev() Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each
>>> sc.version Retrieve SparkContext version 28.866070047722118 4950 partition, and then the results
>>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key
>>> sc.master Master URL to connect to 833.25 .collect()
>>> str(sc.sparkHome) Path where Spark is installed on worker nodes >>> rdd3.histogram(3) Compute histogram by bins [('a',9),('b',2)]
>>> str(sc.sparkUser()) Retrieve name of the Spark User running ([0,33,66,99],[33,33,34])
>>> rdd3.stats() Summary statistics (count, mean, stdev, max & >>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by
SparkContext
min) .collect() applying a function
>>> sc.appName Return application name
>>> sc.applicationId Retrieve application ID
>>> sc.defaultParallelism Return default level of parallelism Mathematical Operations
>>> sc.defaultMinPartitions Default minimum number of partitions for Applying Functions
RDDs >>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd.subtract(rdd2) Return each rdd value not contained
.collect() .collect() in rdd2
Configuration [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] [('b',2),('a',7)]
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2
>>> from pyspark import SparkConf, SparkContext and fla!en the result .collect() with no matching key in rdd
>>> conf = (SparkConf() >>> rdd5.collect() [('d', 1)]
.setMaster("local") ['a',7,7,'a','a',2,2,'a','b',2,2,'b'] >>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd
.setAppName("My app") >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each (key,value) and rdd2
.set("spark.executor.memory", "1g")) .collect() pair of rdd4 without changing the keys
>>> sc = SparkContext(conf = conf) [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]
Sort
Using The Shell Selecting Data >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function
.collect()
In the PySpark shell, a special interpreter-aware SparkContext is already Ge!ing [('d',1),('b',1),('a',2)]
created in the variable called sc. >>> rdd.collect() Return a list with all RDD elements >>> rdd2.sortByKey() Sort (key, value) RDD by key
[('a', 7), ('a', 2), ('b', 2)] .collect()
$ ./bin/spark-shell --master local[2] >>> rdd.take(2) Take first 2 RDD elements [('a',2),('b',1),('d',1)]
$ ./bin/pyspark --master local[4] --py-files code.py [('a', 7), ('a', 2)]
Set which master the context connects to with the --master argument, and >>> rdd.first() Take first RDD element
('a', 7) Repartitioning
add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd.top(2) Take top 2 RDD elements
[('b', 2), ('a', 7)] >>> rdd.repartition(4) New RDD with 4 partitions
comma-separated list to --py-files. >>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1
Sampling
Loading Data >>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97] Saving
Parallelized Collections Filtering >>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)]) .collect() 'org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) [('a',7),('a',2)]
>>> rdd3 = sc.parallelize(range(100)) >>> rdd5.distinct().collect() Return distinct RDD values
>>> rdd4 = sc.parallelize([("a",["x","y","z"]), ['a',2,'b',7] Stopping SparkContext
("b",["p", "r"])]) >>> rdd.keys().collect() Return (key,value) RDD's keys
['a', 'a', 'b'] >>> sc.stop()
External Data
Read either one text file from HDFS, a local file system or or any Iterating Execution
Hadoop-supported file system URI with textFile(), or read in a directory >>> def g(x): print(x)
>>> rdd.foreach(g) Apply a function to all RDD elements $ ./bin/spark-submit examples/src/main/python/pi.py
of text files with wholeTextFiles().
('a', 7)
>>> textFile = sc.textFile("/my/directory/*.txt")
31

('b', 2) DataCamp
>>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Create Your Model Evaluate Your Model’s Performance
Supervised Learning Estimators Classification Metrics
Scikit-Learn
Learn Python for data science Interactively at www.DataCamp.com Linear Regression Accuracy Score
>>> from sklearn.linear_model import LinearRegression >>> knn.score(X_test, y_test) Estimator score method
>>> lr = LinearRegression(normalize=True) >>> from sklearn.metrics import accuracy_score Metric scoring functions
>>> accuracy_score(y_test, y_pred)
Support Vector Machines (SVM)
Scikit-learn >>> from sklearn.svm import SVC Classification Report
>>> svc = SVC(kernel='linear') >>> from sklearn.metrics import classification_report Precision, recall, f1-score
Scikit-learn is an open source Python library that >>> print(classification_report(y_test, y_pred)) and support
Naive Bayes
implements a range of machine learning, >>> from sklearn.naive_bayes import GaussianNB Confusion Matrix
>>> gnb = GaussianNB() >>> from sklearn.metrics import confusion_matrix
preprocessing, cross-validation and visualization >>> print(confusion_matrix(y_test, y_pred))
algorithms using a unified interface. KNN
>>> from sklearn import neighbors Regression Metrics
A Basic Example >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> from sklearn import neighbors, datasets, preprocessing
Mean Absolute Error
>>> from sklearn.model_selection import train_test_split Unsupervised Learning Estimators >>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.metrics import accuracy_score >>> y_true = [3, -0.5, 2]
>>> iris = datasets.load_iris() Principal Component Analysis (PCA) >>> mean_absolute_error(y_true, y_pred)
>>> X, y = iris.data[:, :2], iris.target >>> from sklearn.decomposition import PCA Mean Squared Error
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) >>> pca = PCA(n_components=0.95) >>> from sklearn.metrics import mean_squared_error
>>> scaler = preprocessing.StandardScaler().fit(X_train) >>> mean_squared_error(y_test, y_pred)
>>> X_train = scaler.transform(X_train)
K Means
>>> X_test = scaler.transform(X_test) >>> from sklearn.cluster import KMeans R² Score
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> k_means = KMeans(n_clusters=3, random_state=0) >>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred) Model Fi!ing Clustering Metrics
Adjusted Rand Index
Supervised learning >>> from sklearn.metrics import adjusted_rand_score
Also see NumPy & Pandas >>> lr.fit(X, y) Fit the model to the data
Loading The Data >>> adjusted_rand_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse >>> svc.fit(X_train, y_train) Homogeneity
>>> from sklearn.metrics import homogeneity_score
matrices. Other types that are convertible to numeric arrays, such as Pandas Unsupervised Learning >>> homogeneity_score(y_true, y_pred)
DataFrame, are also acceptable. >>> k_means.fit(X_train) Fit the model to the data
>>> pca_model = pca.fit_transform(X_train) Fit to data, then transform it V-measure
>>> import numpy as np >>> from sklearn.metrics import v_measure_score
>>> X = np.random.random((10,5)) >>> metrics.v_measure_score(y_true, y_pred)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0 Prediction Cross-Validation
>>> from sklearn.cross_validation import cross_val_score
Supervised Estimators >>> print(cross_val_score(knn, X_train, y_train, cv=4))
Training And Test Data >>> y_pred = svc.predict(np.random.random((2,5))) Predict labels >>> print(cross_val_score(lr, X, y, cv=2))
>>> y_pred = lr.predict(X_test) Predict labels
>>> from sklearn.model_selection import train_test_split >>> y_pred = knn.predict_proba(X_test) Estimate probability of a label
>>> X_train, X_test, y_train, y_test = train_test_split(X, Tune Your Model
y, Unsupervised Estimators
random_state=0) >>> y_pred = k_means.predict(X_test) Predict labels in clustering algos Grid Search
>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3),
Preprocessing The Data "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,
Standardization Encoding Categorical Features param_grid=params)
>>> grid.fit(X_train, y_train)
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import LabelEncoder >>> print(grid.best_score_)
>>> scaler = StandardScaler().fit(X_train) >>> print(grid.best_estimator_.n_neighbors)
>>> enc = LabelEncoder()
>>> standardized_X = scaler.transform(X_train) >>> y = enc.fit_transform(y)
>>> standardized_X_test = scaler.transform(X_test) Randomized Parameter Optimization
Normalization Imputing Missing Values >>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5),
>>> from sklearn.preprocessing import Normalizer "weights": ["uniform", "distance"]}
>>> from sklearn.preprocessing import Imputer >>> rsearch = RandomizedSearchCV(estimator=knn,
>>> scaler = Normalizer().fit(X_train) >>> imp = Imputer(missing_values=0, strategy='mean', axis=0) param_distributions=params,
>>> normalized_X = scaler.transform(X_train) >>> imp.fit_transform(X_train) cv=4,
>>> normalized_X_test = scaler.transform(X_test) n_iter=8,
random_state=5)
Binarization Generating Polynomial Features >>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
>>> from sklearn.preprocessing import Binarizer >>> from sklearn.preprocessing import PolynomialFeatures
>>> binarizer = Binarizer(threshold=0.0).fit(X) >>> poly = PolynomialFeatures(5)
32

>>> binary_X = binarizer.transform(X) >>> poly.fit_transform(X) DataCamp


Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Model Architecture Inspect Model
Sequential Model >>> model.output_shape Model output shape
Keras >>> model.summary() Model summary representation
>>> from keras.models import Sequential >>> model.get_config() Model configuration
Learn Python for data science Interactively at www.DataCamp.com >>> model = Sequential() >>> model.get_weights() List all weight tensors in the model
>>> model2 = Sequential()
>>> model3 = Sequential() Compile Model
Multilayer Perceptron (MLP) MLP: Binary Classification
Keras Binary Classification >>> model.compile(optimizer='adam',
loss='binary_crossentropy',
Keras is a powerful and easy-to-use deep learning library for >>> from keras.layers import Dense metrics=['accuracy'])
Theano and TensorFlow that provides a high-level neural >>> model.add(Dense(12, MLP: Multi-Class Classification
input_dim=8, >>> model.compile(optimizer='rmsprop',
networks API to develop and evaluate deep learning models. kernel_initializer='uniform', loss='categorical_crossentropy',
activation='relu')) metrics=['accuracy'])
A Basic Example >>> model.add(Dense(8,kernel_initializer='uniform',activation='relu'))
MLP: Regression
>>> model.add(Dense(1,kernel_initializer='uniform',activation='sigmoid')) >>> model.compile(optimizer='rmsprop',
>>> import numpy as np loss='mse',
>>> from keras.models import Sequential Multi-Class Classification metrics=['mae'])
>>> from keras.layers import Dense >>> from keras.layers import Dropout
>>> data = np.random.random((1000,100)) >>> model.add(Dense(512,activation='relu',input_shape=(784,))) Recurrent Neural Network
>>> labels = np.random.randint(2,size=(1000,1)) >>> model.add(Dropout(0.2)) >>> model3.compile(loss='binary_crossentropy',
>>> model = Sequential() optimizer='adam',
>>> model.add(Dense(512,activation='relu')) metrics=['accuracy'])
>>> model.add(Dense(32, >>> model.add(Dropout(0.2))
activation='relu', >>> model.add(Dense(10,activation='softmax'))
input_dim=100))
>>> model.add(Dense(1, activation='sigmoid'))
Regression Model Training
>>> model.compile(optimizer='rmsprop', >>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1])) >>> model3.fit(x_train4,
loss='binary_crossentropy', >>> model.add(Dense(1)) y_train4,
metrics=['accuracy']) batch_size=32,
>>> model.fit(data,labels,epochs=10,batch_size=32) Convolutional Neural Network (CNN) epochs=15,
verbose=1,
>>> predictions = model.predict(data) >>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten validation_data=(x_test4,y_test4))
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
Also see NumPy, Pandas & Scikit-Learn >>> model2.add(Activation('relu'))
Data >>> model2.add(Conv2D(32,(3,3))) Evaluate Your Model's Performance
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ide- >>> model2.add(Activation('relu')) >>> score = model3.evaluate(x_test,
>>> model2.add(MaxPooling2D(pool_size=(2,2))) y_test,
ally, you split the data in training and test sets, for which you can also resort batch_size=32)
>>> model2.add(Dropout(0.25))
to the train_test_split module of sklearn.cross_validation.
>>> model2.add(Conv2D(64,(3,3), padding='same'))
Keras Data Sets >>> model2.add(Activation('relu')) Prediction
>>> model2.add(Conv2D(64,(3, 3)))
>>> from keras.datasets import boston_housing, >>> model2.add(Activation('relu')) >>> model3.predict(x_test4, batch_size=32)
mnist, >>> model2.add(MaxPooling2D(pool_size=(2,2))) >>> model3.predict_classes(x_test4,batch_size=32)
cifar10, >>> model2.add(Dropout(0.25))
imdb
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data() >>> model2.add(Flatten()) Save/ Reload Models
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data() >>> model2.add(Dense(512))
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> model2.add(Activation('relu')) >>> from keras.models import load_model
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> model2.add(Dropout(0.5)) >>> model3.save('model_file.h5')
>>> num_classes = 10 >>> my_model = load_model('my_model.h5')
>>> model2.add(Dense(num_classes))
>>> model2.add(Activation('softmax'))
Other
Recurrent Neural Network (RNN) Model Fine-tuning
>>> from urllib.request import urlopen
>>> data = np.loadtxt(urlopen("http://archive.ics.uci.edu/ >>> from keras.klayers import Embedding,LSTM Optimization Parameters
ml/machine-learning-databases/pima-indians-diabetes/
pima-indians-diabetes.data"),delimiter=",") >>> model3.add(Embedding(20000,128)) >>> from keras.optimizers import RMSprop
>>> X = data[:,0:8] >>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> opt = RMSprop(lr=0.0001, decay=1e-6)
>>> y = data [:,8] >>> model3.add(Dense(1,activation='sigmoid')) >>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Preprocessing Also see NumPy & Scikit-Learn
Early Stopping
Sequence Padding Train and Test Sets >>> from keras.callbacks import EarlyStopping
>>> from keras.preprocessing import sequence >>> from sklearn.model_selection import train_test_split >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, >>> model3.fit(x_train4,
>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) y,
test_size=0.33, y_train4,
random_state=42) batch_size=32,
One-Hot Encoding epochs=15,
>>> from keras.utils import to_categorical Standardization/Normalization validation_data=(x_test4,y_test4),
>>> Y_train = to_categorical(y_train, num_classes) >>> from sklearn.preprocessing import StandardScaler callbacks=[early_stopping_monitor])
>>> Y_test = to_categorical(y_test, num_classes) >>> scaler = StandardScaler().fit(x_train2)
33

>>> Y_train3 = to_categorical(y_train3, num_classes) >>> standardized_X = scaler.transform(x_train2) DataCamp


>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = scaler.transform(x_test2) Learn Python for Data Science Interactively
Python for Data Science Cheat Sheet Spans Syntax iterators
Learn more Python for data science interactively at www.datacamp.com
Accessing spans Sentences USUALLY NEEDS THE DEPENDENCY PARSER

Span indices are exclusive. So doc[2:4] is a span starting at doc = nlp("This a sentence. This is another one.")
token 2, up to – but not including! – token 4. # doc.sents is a generator that yields sentence spans
About spaCy
[sent.text for sent in doc.sents]
spaCy is a free, open-source library for advanced Natural doc = nlp("This is a text") # ['This is a sentence.', 'This is another one.']
Language Processing (NLP) in Python. It's designed span = doc[2:4]
specifically for production use and helps you build span.text
applications that process and "understand" large volumes Base noun phrases NEEDS THE TAGGER AND PARSER
# 'a text'
of text. Documentation: spacy.io
doc = nlp("I have a red car")
Creating a span manually
# doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]
$ pip install spacy # Import the Span object
# ['I', 'a red car']
from spacy.tokens import Span
# Create a Doc object
import spacy
doc = nlp("I live in New York")
# Span for "New York" with label GPE (geopolitical)
span = Span(doc, 3, 5, label="GPE") Label explanations
Statistical models span.text
# 'New York' spacy.explain("RB")
# 'adverb'
Download statistical models spacy.explain("GPE")
Predict part-of-speech tags, dependency labels, named # 'Countries, cities, states'
entities and more. See here for available models: Linguistic features
spacy.io/models
Attributes return label IDs. For string labels, use the
$ python -m spacy download en_core_web_sm attributes with an underscore. For example, token.pos_ . Visualizing
Check that your installed models are up to date If you're in a Jupyter notebook, use displacy.render .
Part-of-speech tags PREDICTED BY STATISTICAL MODEL Otherwise, use displacy.serve to start a web server and
$ python -m spacy validate show the visualization in your browser.
doc = nlp("This is a text.")
Loading statistical models # Coarse-grained part-of-speech tags from spacy import displacy
[token.pos_ for token in doc]
import spacy # ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']
# Load the installed model "en_core_web_sm" # Fine-grained part-of-speech tags Visualize dependencies
nlp = spacy.load("en_core_web_sm") [token.tag_ for token in doc]
# ['DT', 'VBZ', 'DT', 'NN', '.'] doc = nlp("This is a sentence")
displacy.render(doc, style="dep")

Syntactic dependencies PREDICTED BY STATISTICAL MODEL


Documents and tokens
doc = nlp("This is a text.")
Processing text # Dependency labels
Processing text with the nlp object returns a Doc object [token.dep_ for token in doc]
that holds all information about the tokens, their linguistic # ['nsubj', 'ROOT', 'det', 'attr', 'punct']
features and their relationships # Syntactic head token (governor)
[token.head.text for token in doc]
# ['is', 'is', 'text', 'is', 'is']
doc = nlp("This is a text")
Visualize named entities
Accessing token attributes Named entities PREDICTED BY STATISTICAL MODEL
doc = nlp("Larry Page founded Google")
doc = nlp("This is a text") doc = nlp("Larry Page founded Google") displacy.render(doc, style="ent")
# Token texts # Text and label of named entity span
[token.text for token in doc] [(ent.text, ent.label_) for ent in doc.ents]
# ['This', 'is', 'a', 'text'] # [('Larry Page', 'PERSON'), ('Google', 'ORG')]
34
Word vectors and similarity Extension attributes Rule-based matching
To use word vectors, you need to install the larger models Custom attributes that are registered on the global Doc ,
ending in md or lg , for example en_core_web_lg . Token and Span classes and become available as ._ . Token patterns
# "love cats", "loving cats", "loved cats"
Comparing similarity
from spacy.tokens import Doc, Token, Span pattern1 = [{"LEMMA": "love"}, {"LOWER": "cats"}]
doc1 = nlp("I like cats") doc = nlp("The sky over New York is blue") # "10 people", "twenty people"
doc2 = nlp("I like dogs") pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]
# Compare 2 documents # "book", "a cat", "the sea" (noun + optional article)
doc1.similarity(doc2) Attribute extensions WITH DEFAULT VALUE pattern3 = [{"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]
# Compare 2 tokens
# Register custom attribute on Token class
doc1[2].similarity(doc2[2])
Token.set_extension("is_color", default=False) Operators and quantifiers
# Compare tokens and spans
# Overwrite extension attribute with default value
doc1[0].similarity(doc2[1:3]) Can be added to a token dict as the "OP" key.
doc[6]._.is_color = True

Accessing word vectors ! Negate pattern and match exactly 0 times.


Property extensions WITH GETTER & SETTER
? Make pattern optional and match 0 or 1 times.
# Vector as a numpy array # Register custom attribute on Doc class
doc = nlp("I like cats") get_reversed = lambda doc: doc.text[::-1] + Require pattern to match 1 or more times.
# The L2 norm of the token's vector Doc.set_extension("reversed", getter=get_reversed)
doc[2].vector * Allow pattern to match 0 or more times.
# Compute value of extension attribute with getter
doc[2].vector_norm doc._.reversed
# 'eulb si kroY weN revo yks ehT'
Glossary
Method extensions CALLABLE METHOD Tokenization Segmenting text into words, punctuation etc.
Pipeline components
Lemmatization Assigning the base forms of words, for example:
Functions that take a Doc object, modify it and return it. # Register custom attribute on Span class
"was" → "be" or "rats" → "rat".
has_label = lambda span, label: span.label_ == label
Span.set_extension("has_label", method=has_label) Sentence Boundary Finding and segmenting individual sentences.
nlp
# Compute value of extension attribute with method Detection

Text tokenizer tagger parser ner ... Doc doc[3:5].has_label("GPE") Part-of-speech (POS) Assigning word types to tokens like verb or noun.
# True Tagging

Dependency Parsing Assigning syntactic dependency labels,


describing the relations between individual
Pipeline information tokens, like subject or object.
Rule-based matching
nlp = spacy.load("en_core_web_sm") Named Entity Labeling named "real-world" objects, like
nlp.pipe_names Recognition (NER) persons, companies or locations.
# ['tagger', 'parser', 'ner'] Using the matcher Text Classification Assigning categories or labels to a whole
nlp.pipeline
document, or parts of a document.
# [('tagger', <spacy.pipeline.Tagger>), # Matcher is initialized with the shared vocab
# ('parser', <spacy.pipeline.DependencyParser>), from spacy.matcher import Matcher Statistical model Process for making predictions based on
# ('ner', <spacy.pipeline.EntityRecognizer>)] # Each dict represents one token and its attributes examples.
matcher = Matcher(nlp.vocab) Training Updating a statistical model with new examples.
# Add with ID, optional callback and pattern(s)
Custom components
pattern = [{"LOWER": "new"}, {"LOWER": "york"}]
# Function that modifies the doc and returns it matcher.add("CITIES", None, pattern)
def custom_component(doc): # Match by calling the matcher on a Doc object
print("Do something to the doc here!") doc = nlp("I live in New York")
return doc matches = matcher(doc)
# Add the component first in the pipeline # Matches are (match_id, start, end) tuples
nlp.add_pipe(custom_component, first=True) for match_id, start, end in matches: Learn Python for
# Get the matched span by slicing the Doc data science interactively at
span = doc[start:end]
Components can be added first , last (default), or www.datacamp.com
print(span.text)
before or after an existing component. # 'New York'
35
36
©2012-2015 - Laurent Pointal Mémento v2.0.6
License Creative Commons Attribution 4 Python 3 Cheat Sheet La
ht
t
p
t
e
s
s
t
:
/
v
/
p
e
e
r
s
r
s
i
o
o
no
.
li
n:
msi
.
fr/
poi
nt
al
/p
yth
on:
meme
nto
i
nt
ege
r,flo
at,b
ool
ean
,st
ri
ng,b
yte
s Base Types ◾ ordered sequences, fast index access, repeatable values Container Types
int783 0 -192 0b010 0o642 0xF3 list [1,5,9] ["x",11,8.9] ["mot"] []
z
ero b
ina
ry o
cta
l h
exa tuple (1,5,9) 11,"y",7.4 ("mot",) ()
float9.23 0.0 -1.7e-6
-6 No
nmo
difia
blev
alu
es(
immu
tabl
es) ☝e xp
ress
ionwit
hon
lycoma
s"tuple
boolTrue False ×10
""
str bytes ( o
rdere
dse
que
ncesofcha
rs/b
yte
s)
str"One\nTwo" Mu lti l
i
nes
tr
ing
: b""
e
sca
pedn
ewl
i
ne """X\tY\tZ ◾ key containers, no ap
rio
riorder, fast key access, each key is unique
'I\'m' 1\t2\t3""" dictionary dict {"key":"value"} dict(a=3,b=4,k="v") {}
e
sca
ped' e
sca
pedt
ab (
key
/va
luea
sso
cia
ti
ons
){1:"one",3:"three",2:"two",3.14:"2"}
bytesb"toto\xfe\775" collection set {"key1","key2"} {1,9,3,0} set()
h
exa
dec
ima
loc
tal ☝i
mmu
tab
les ☝k
eys
=ha
sha
blev
alu
es(
bas
ety
pes
,immu
tab
les
…) frozenseti
mmu
tab
les
et e
mpt
y

f
orvari
abl
es
,fun
cti
ons
, Identiers type(e xp
res si
on ) Conversions
modu
les
,cl
ass
es…names
int("15") → 15
int("3f",16) → 63 nd
can specify integer number base in 2 parameter
a…zA…Z_followed by a…zA…Z_0…9
◽ diacritics allowed but should be avoided int(15.56) → 15 truncate decimal part
◽ language keywords forbidden float("-11.24e8") → -1124000000.0
◽ lower/UPPER case discrimination round(15.56,1)→ 15.6 rounding to 1 decimal (0 decimal → integer number)
a toto x7 y_max BigOne bool(x) Falsefor null x, empty container x, Noneor Falsex ; Truefor other x
⇥ 8y and for str(x)→ "…" representation string of xfor display ( cf.f orma tt
in gont
heb a
ck)
chr(64)→'@' ord('@')→64 code ↔ char
= Variables assignment
repr(x)→ "…" l it
er
a l
 representation string of x
☝ assignment ⇔ binding of a namewith a v
alu
e
1)e va l
uati
onofr i
ghtsideex pre
ssio
nva l
u e bytes([72,9,64]) → b'H\t@'
2)a ssignmenti
no rderwithl eft
sidena
me s list("abc") → ['a','b','c']
x=1.2+8+sin(y) dict([(3,"three"),(1,"one")]) → {1:'one',3:'three'}
a=b=c=0 as s
ignme ntt
osamev a
lue set(["one","two"]) → {'one','two'}
y,z,r=9.2,-7.6,0 mul t
ipl
eass
ign
ments separat
orstrand s equenceo fstr→ a ssemb ledstr
a,b=b,a valu esswap ':'.join(['toto','12','pswd']) → 'toto:12:pswd'
a,*b=seq unpacki ngofseq
uencei
n strs plit
tedo nwh it
espaces→ listof str
*a,b=seq i tema n
dl i
st "words with spaces".split() → ['words','with','spaces']
and strs plit
tedo ns eparatorstr→ listo fstr
x+=3 i ncre
me nt% x=x+3 *=
x-=2 decr eme nt% x=x-2 /= "1,4,8,2".split(",") → ['1','4','8','2']
x=None «undefin
e d»const
antval
ue %= sequence of one type → listof another type ( vi
al istcomp rehe nsion )
del x r
emo
ven
amex … [int(x) for x in ('1','29','-3')] → [1,29,-3]
f
orl
i
sts
,tu
ples
,st
ri
ngs
,by
te s
… Sequence Containers Indexing
n
egat
iv
ein
dex -5 -4 -3 -2 -1 Items count Individual access to items via lst[i
nde
x]
pos
it
iv
einde
x 0 1 2 3 4 len(lst)→5 lst[0]→10 - fir
stone lst[1]→20
lst=[10, 20, 30, 40, 50] lst[-1]→50 -l asto
ne lst[-2]→40
pos
it
iv
esl
ic
e 0 1 2 3 4 5 ☝ index from 0
Onmutab
les
equ
ence
s(list)
,re
movewit
h
n
egat
iv
esl
i
ce -5 -4 -3 -2 -1 (here from 0 to 4)
del lst[3]andmodi
fywi
thass
ign
ment
lst[4]=25
Access to sub-sequences via lst[s
tar
tsl
i
ce:e
nds
li
ce:s
te
p]
lst[:-1]C[10,20,30,40] lst[::-1]C[50,40,30,20,10] lst[1:3]C[20,30] lst[:3]C[10,20,30]
lst[1:-1]C[20,30,40] lst[::-2]C[50,30,10] lst[-3:-1]C[30,40] lst[3:]C[40,50]
lst[::2]C[10,30,50] lst[:]C[10,20,30,40,50]sh
all
owco
pyofs
equ
enc
e
Mis
si
ngsl
icei
ndi
cat
ion" fr
oms t
art
/upt
oen
d.
Onmuta
blese
quenc
es(list),r
emovewi
t
hdel lst[3:5]a
ndmo
dif
ywi
t
has
si
gnme
ntlst[1:4]=[15,25]

Boolean Logic Statements Blocks mo


dul
etruc%fil
etruc.py Modules/Names Imports
Comparisons : < > <= >= == != from monmod import nom1,nom2 as fct
(b
ool
eanresul
ts) ⌅ ⇤ = p
are
ntst
atement
: "di
re
cta
cce
sst
ona
mes
,re
nami
ngwi
t
has
a and b logical and b
oths
imu
lt
a- s
tat
ementbl
ock1… import monmod"ac
ces
svi
amonmod.nom1…
indentation !

-
neo
usl
y H ☝ modules and packages searched in p
yth
onp
ath(cf sys.path)
a or b logical or oneorot
her p
arent
stat
ement
: s
tat
ement
blo
ckexec
ute
don
ly
orb o
th Conditional Statement
s
tat
ementbl
ock2… i
fac o
ndi
ti
onist
rue
☝ pitfall : anda ndorre
tur
nv al
ueofao
r ⌅ yes no yes
ofb( un de rsho
rtc
ute
val
uat
ion)
. if l
ogi
calc
ondi
ti
on: ? ?
-e n su ret ha
taa ndbareboo
leans
. no
not a logical not
n
ext
sta
teme
nta
fte
rbl
ock1 s
tat
ement
sbl
ock
True Can go with several el i
f,el
if... and only one
True and False constants ☝config
uree
dit
ort
oinse
rt4sp
ace
sin if age<=18:
False 8nal else
. Only the block of 8rst true
pl
aceo fani
ndent
at
io
ntab. state="Kid"
condition is executed. elif age>65:
☝ flo
ati
ngn
umb
ers
…ap
pro
xima
tedv
alu
es a
ngl
esi
nra
dia
ns Maths ☝ wi
t
havarx: state="Retired"
if bool(x)==True: ⇔ if x: else:
Operators: + - * / // % ** from math import sin,pi… state="Active"
if bool(x)==False:⇔ if not x:
Priority (…) × ÷ ab sin(pi/4)C0.707…
integer ÷ ÷ remainder cos(2*pi/3)C-0.4999… Exceptions on Errors
Signaling an error:
@→ matrix × pyt
hon3.
5+numpy sqrt(81)C9.0 N raise Ex c Cl
ass(
…) err
or
(1+5.3)*2C12.6 log(e**2)C2.0 Errors processing: n
orma
l proc
ess
ing
abs(-3.2)C3.2 ceil(12.5)C13 try: raiseX(
) e
rro
rraise
p
roc
ess
ing p
roce
ssi
ng
round(3.57,1)C3.6 floor(12.5)C12 nor ma lproc esi
si
ngblo
ck
pow(4,3)C64.0 modules math, statistics, random, except Ex c
ep t
i on as e: ☝ finallybl
ockf
orfin
alp
roc
ess
ing
☝usu alo rdero fop er
ati
ons decimal, fractions, numpy, etc.(c
f.do
c) errorp roces si
ngblo
ck inallc
ase
s.
37
!
s s
tat
ement
sbl
ockex
ecu
tedasl
ongas Conditional Loop Statement s
tat
ement
sbl
ockexe
cut
edfo
re h Iterative Loop Statement
ac
c
ondit
i
onist
rue i
t
emo facon
tai
nerori
te
rat
or
op

yes next
for var in s
eque
nce
o

while l
ogi
calco
ndit
i
on: :
el

? Loop Control …
t

no 8nish
fini

s
tat
ement
sblo
ck break i
mmediat
eexi
t st
at
emen
tsb
loc
k
continue n
exti
ter
ati
on
fin

s=0 i
ni
ti
al
iz
ati
onsbe
for
ethel
oop ☝ elseblo
ckf
orno
rmal Go over sequence's values
eo

i=1 c loopex
it
. s = "Some text" i ni
ti
ali
za t
i
onsbe
for
etheloop
ondi
ti
onwithal
east
oneva
ria
blev
alu
e(h
erei)
r
wa

Al
go: cnt = 0

☝ good habit : don't modify loop variable


while i <= 100: i=100 l
o opva ri
ab l
e,as si
g nme ntmanag e
dbyfors t
atement
e

s= ∑ i 2
☝b

s = s + i**2 for c in s:
i=i+1 ☝ ma
kec
ondi
t
ionv
ari
abl
ech
ang
e! if c == "e": Algo:cou
nt
print("sum:",s) i=1 cnt = cnt + 1 numb e
rofe
print("found",cnt,"'e'") inth
es t
ri
ng.
print("v=",3,"cmO:",x,",",y+4) Display loop on dict/set ⇔ loop on keys sequences
use sl
icesto loop on a subset of a sequence

items to display : literal values, variables, expressions Go over sequence's index


printoptions: ◽ modify item at index
◽ sep=" " items separator, default space ◽ access items around index (before / after)
lst = [11,18,9,12,23,4,17]
◽ end="\n" end of print, default new line lost = []
◽ file=sys.stdout print to 8le, default standard output for idx in range(len(lst)): Alg
o:l
imi
tval
uesgr
eat
er
val = lst[idx] t
han15,memori
zi
ng
s = input("Instructions:") Input
if val > 15: ofl
ost
valu
es.
☝ inputalways returns a string, convert it to required type lost.append(val)
(cf. boxed Co
nversi
on son the other side). lst[idx] = 15
print("modif:",lst,"-lost:",lost)
len(c)→ items count Generic Operations on Containers Go simultaneously over sequence's index and values:
min(c) max(c) sum(c) No te:Fordi c
ti
on ari
esan
dse
ts
,th
ese for idx,val in enumerate(lst):
sorted(c)→ listsorted c o py op erat
ionsu s
eke ys.
val in c→ boolean, membership operator in(absence not in) range([ star t
,]e n d[ ,
step ]) Integer Sequences
enumerate(c)→ i terato
ron (index, value) ☝st
art
default 0, e
n dnot included in sequence, s
te
psigned, default 1
zip(c1,c2…)→ i t
er
a to
ron tuples containing ci items at same index
range(5)→ 0 1 2 3 4 range(2,12,3)→ 2 5 8 11
all(c)→ Trueif all citems evaluated to true, else False range(3,8)→ 3 4 5 6 7 range(20,5,-5)→ 20 15 10
any(c)→ Trueif at least one item of cevaluated true, else False range(len(s eq))" s equenceo fi ndexofv alu
esinseq
Spe
ci
fict
oor
der
edsequenc esconta i
ne rs(li
sts,tu
ple
s,s
tr
ing
s,b
yte
s…) ☝r
ang
epr
ovi
desa
nimmu
tab
les
equ
enc
eofi
ntc
ons
tr
uct
eda
sne
ede
d
reversed(c)" i nv erse
di t
erator c*5→ duplicate c+c2→ concatenate
c.index(v al
)" pos i
ti
on c.count(v al
)" even
tsc
oun
t function name (identi8er) Function Denition
import copy named parameters
copy.copy(c)→ shallow copy of container def fct(x,y,z):
copy.deepcopy(c)→ deep copy of container fct
"""documentation"""
☝ modify original list Operations on Lists #st
at
ement
sbl
ock,r
esc o mp u tatio n,e tc .
lst.append(val
) add item at end return res result value of the call, if no computed
result to return: return None
lst.extend(se
q) add sequence of items at end ☝ parameters and all
lst.insert(i
dx,val
) insert item at index variables of this block exist only inthe block and du r
ingthe function
lst.remove(val
) remove 8rst item with value v al call (think of a “black box” )
lst.pop([i
dx]
)"v a
lue remove & return item at index i
dx(default last) Advanced: def fct(x,y,z,*args,a=3,b=5,**kwargs):
lst.sort() lst.reverse() sort / reverse liste i nplace *a r gsvariab lepo si
ti
o nala r
gume nt
s( "tuple) ,de
faultva l
ues,
**k wa r
g sva riablen ame da r
gu me n t
s( "dict)
Operations on Dictionaries Operations on Sets
d[ke
y]=value d.clear() Operators: r = fct(3,i+2,2*i) Function Call
| → union (vertical bar char) s
tor
age
/us
eof o
nearg
ument
per
d[ke
y]" value del d[k ey] & → intersection r
et
urne
dvalu
e p
ara
meter
d.update(d2) update/add - ^ → diNerence/symmetric diN. ☝ this is the use of function fct() fct
associations Advanc
ed:
d.keys() < <= > >= → inclusion relations name wi thp arenthe se
s *se
quenc
e
d.values() "i t
erablev iewso n Operatorsa l
soexis
t asme th ods. which does the call **di
ct
d.items() k eys/
values/a s
so c
iat
i
ons
d.pop(key[
,defa
u l
t])" v a l
u e s.update(s2) s.copy()
s.add(ke
y) s.remove(key) s.startswith(p r
efix[,
st
art
[,e
n d]
]) Operations on Strings
d.popitem()" ( key,
v alu e)
d.get(key[
,defa
u l
t])" v a l
u e s.discard(key
) s.clear() s.endswith(suBx [,star
t[,
end]]
) s.strip([ c
hars]
)
d.setdefault(k
ey[
,de
fau
lt
])"v
alu
e s.pop() s.count(sub
[,s
tar t
[,
e nd]]
) s.partition(s ep)" (bef
ore,
se
p,after)
s.index(sub
[,s
tar t
[,
e nd]]
) s.find(s ub[
,star
t[,
end]]
)
s
tor
ingda
tao
ndi
sk,a
ndr
eadi
ngi
tba
ck Files s.is…() t
est
sonc h
ar scat
egori
es(e
x.s.isalpha())
f = open("file.txt","w",encoding="utf8") s.upper() s.lower() s.title() s.swapcase()
s.casefold() s.capitalize() s.center([ wi
dth,fil
l])
8le variable name of 8le
opening mode encoding of s.ljust([wi
dth,fil
l]
) s.rjust([ widt
h,fil
l]
) s.zfill([ width])
for operations on disk ◽ 'r'read chars for t
ext s.encode(encoding) s.split([ se
p]) s.join(s e
q)
(+path…) ◽ 'w'write fil
es:
◽ 'a'append utf8 ascii formating directives values to format Formatting
cf. modules os, os.pathand pathlib◽ …'+' 'x' 'b' 't' latin1 …
"modele{} {} {}".format(x,y,r) str
writing ☝r
eade
mpt
yst
ri
ngi
fen
doffil
e reading "{se
lec
ti
on:f
orma
tt
in
g!con
ver
si
on}"
f.write("coucou") f.read([
n]) "n
ext
cha
rs
◽ Selection :
f.writelines(l
i
sto
fli
nes
) i
fnn otspeci
fied,r eadu ptoend! "{:+2.3f}".format(45.72793)
f.readlines([ n] ) " listo fne
xtl
i
nes 2 →'+45.728'
Examples

f.readline() "n e
x t
li
ne nom "{1:>10s}".format(8,"toto")
☝te
xtmodetb yde fault(rea d/writestr) ,p ossibl
eb i
nar
y 0.nom →' toto'
4[key] "{x!r}".format(x="I'm")
modeb( read/wr i
tebytes) .Co n ver
tfrom/ tor equi r
edtype! 0[2]
f.close() ☝ dont forget to close the le after use ! →'"I\'m"'
◽ Formatting :
f.flush()write cache f.truncate([ size ]
) resize fil
lch ara lig n
men
tsi
gn mi
niwi
dth
.pre
cis
ion~ma
xwi
dt
h type
re
adi
ng/
wri
t
ingpr
ogressseq u
e nti
a l
lyi nt hefile,modifiablewi th:
<>^= +-s pac
e 0at start for 8lling with 0
f.tell()"p
osi
t
ion f.seek(p
osi
t
ion[
,or
igi
n]) integer: bbinary, cchar, ddecimal (default), ooctal, xor Xhexa…
Very common: opening with a guarded block with open(…) as f: Moat: eor Eexponential, for F8xed point, gor Gappropriate (default), 
(automatic closing) and reading loop on lines for line in fO: string: s… %percent
of a text 8le: #p r
oces
si
ngo
fline ◽ Conversion : s(readable text) or r(literal representation)
Dictionaries store connections between pieces of
List comprehensions information. Each item in a dictionary is a key-value pair.
squares = [x**2 for x in range(1, 11)] A simple dictionary
Slicing a list alien = {'color': 'green', 'points': 5}
finishers = ['sam', 'bob', 'ada', 'bea']
Accessing a value
first_two = finishers[:2]
print("The alien's color is " + alien['color'])
Variables are used to store values. A string is a series of Copying a list
characters, surrounded by single or double quotes. Adding a new key-value pair
copy_of_bikes = bikes[:]
Hello world alien['x_position'] = 0
print("Hello world!") Looping through all key-value pairs
Hello world with a variable Tuples are similar to lists, but the items in a tuple can't be fav_numbers = {'eric': 17, 'ever': 4}
modified. for name, number in fav_numbers.items():
msg = "Hello world!"
print(name + ' loves ' + str(number))
print(msg) Making a tuple
Concatenation (combining strings) dimensions = (1920, 1080)
Looping through all keys
fav_numbers = {'eric': 17, 'ever': 4}
first_name = 'albert'
for name in fav_numbers.keys():
last_name = 'einstein'
print(name + ' loves a number')
full_name = first_name + ' ' + last_name If statements are used to test for particular conditions and
print(full_name) respond appropriately. Looping through all the values
Conditional tests fav_numbers = {'eric': 17, 'ever': 4}
for number in fav_numbers.values():
equals x == 42 print(str(number) + ' is a favorite')
A list stores a series of items in a particular order. You
not equal x != 42
access items using an index, or within a loop.
greater than x > 42
Make a list or equal to x >= 42
less than x < 42 Your programs can prompt the user for input. All input is
bikes = ['trek', 'redline', 'giant'] or equal to x <= 42 stored as a string.
Get the first item in a list Conditional test with lists Prompting for a value
first_bike = bikes[0] 'trek' in bikes name = input("What's your name? ")
Get the last item in a list 'surly' not in bikes print("Hello, " + name + "!")
last_bike = bikes[-1] Assigning boolean values Prompting for numerical input
Looping through a list game_active = True age = input("How old are you? ")
can_edit = False age = int(age)
for bike in bikes:
print(bike) A simple if test
pi = input("What's the value of pi? ")
Adding items to a list if age >= 18: pi = float(pi)
print("You can vote!")
bikes = []
bikes.append('trek') If-elif-else statements
bikes.append('redline') if age < 4:
bikes.append('giant') ticket_price = 0
Making numerical lists elif age < 18: Covers Python 3 and Python 2
ticket_price = 10
squares = [] else:
for x in range(1, 11): ticket_price = 15
squares.append(x**2)
38
A while loop repeats a block of code as long as a certain A class defines the behavior of an object and the kind of Your programs can read from files and write to files. Files
condition is true. information an object can store. The information in a class are opened in read mode ('r') by default, but can also be
is stored in attributes, and functions that belong to a class opened in write mode ('w') and append mode ('a').
A simple while loop are called methods. A child class inherits the attributes and
methods from its parent class. Reading a file and storing its lines
current_value = 1
while current_value <= 5: filename = 'siddhartha.txt'
Creating a dog class
print(current_value) with open(filename) as file_object:
current_value += 1 class Dog(): lines = file_object.readlines()
"""Represent a dog."""
Letting the user choose when to quit for line in lines:
msg = '' def __init__(self, name): print(line)
while msg != 'quit': """Initialize dog object."""
self.name = name Writing to a file
msg = input("What's your message? ")
print(msg) filename = 'journal.txt'
def sit(self): with open(filename, 'w') as file_object:
"""Simulate sitting.""" file_object.write("I love programming.")
print(self.name + " is sitting.")
Functions are named blocks of code, designed to do one Appending to a file
specific job. Information passed to a function is called an my_dog = Dog('Peso')
argument, and information received by a function is called a filename = 'journal.txt'
parameter. with open(filename, 'a') as file_object:
print(my_dog.name + " is a great dog!") file_object.write("\nI love making games.")
A simple function my_dog.sit()

def greet_user(): Inheritance


"""Display a simple greeting.""" class SARDog(Dog): Exceptions help you respond appropriately to errors that
print("Hello!") """Represent a search dog.""" are likely to occur. You place code that might cause an
error in the try block. Code that should run in response to
greet_user() def __init__(self, name): an error goes in the except block. Code that should run only
Passing an argument """Initialize the sardog.""" if the try block was successful goes in the else block.
super().__init__(name)
def greet_user(username): Catching an exception
"""Display a personalized greeting.""" def search(self): prompt = "How many tickets do you need? "
print("Hello, " + username + "!") """Simulate searching.""" num_tickets = input(prompt)
print(self.name + " is searching.")
greet_user('jesse') try:
my_dog = SARDog('Willie') num_tickets = int(num_tickets)
Default values for parameters
except ValueError:
def make_pizza(topping='bacon'): print(my_dog.name + " is a search dog.") print("Please try again.")
"""Make a single-topping pizza.""" my_dog.sit() else:
print("Have a " + topping + " pizza!") my_dog.search() print("Your tickets are printing.")

make_pizza()
make_pizza('pepperoni')
If you had infinite programming skills, what would you Simple is better than complex
Returning a value build?
If you have a choice between a simple and a complex
def add_numbers(x, y): As you're learning to program, it's helpful to think solution, and both work, use the simple solution. Your
"""Add two numbers and return the sum.""" about the real-world projects you'd like to create. It's
return x + y code will be easier to maintain, and it will be easier
a good habit to keep an "ideas" notebook that you for you and others to build on that code later on.
can refer to whenever you want to start a new project.
sum = add_numbers(3, 5)
print(sum)
If you haven't done so already, take a few minutes
and describe three projects you'd like to create. More cheat sheets available at
39
You can add elements to the end of a list, or you can insert The sort() method changes the order of a list permanently.
them wherever you like in a list. The sorted() function returns a copy of the list, leaving the
original list unchanged. You can sort the items in a list in
Adding an element to the end of the list alphabetical order, or reverse alphabetical order. You can
users.append('amy') also reverse the original order of the list. Keep in mind that
lowercase and uppercase letters may affect the sort order.
Starting with an empty list
Sorting a list permanently
users = []
A list stores a series of items in a particular order. users.append('val') users.sort()
Lists allow you to store sets of information in one users.append('bob') Sorting a list permanently in reverse alphabetical
place, whether you have just a few items or millions users.append('mia')
order
of items. Lists are one of Python's most powerful Inserting elements at a particular position
features readily accessible to new programmers, and users.sort(reverse=True)
they tie together many important concepts in users.insert(0, 'joe')
Sorting a list temporarily
programming. users.insert(3, 'bea')
print(sorted(users))
print(sorted(users, reverse=True))

You can remove elements by their position in a list, or by Reversing the order of a list
Use square brackets to define a list, and use commas to
separate individual items in the list. Use plural names for the value of the item. If you remove an item by its value, users.reverse()
lists, to make your code easier to read. Python removes only the first item that has that value.

Making a list Deleting an element by its position

users = ['val', 'bob', 'mia', 'ron', 'ned'] del users[-1] Lists can contain millions of items, so Python provides an
efficient way to loop through all the items in a list. When
Removing an item by its value you set up a loop, Python pulls each item from the list one
users.remove('mia') at a time and stores it in a temporary variable, which you
Individual elements in a list are accessed according to their provide a name for. This name should be the singular
position, called the index. The index of the first element is version of the list name.
0, the index of the second element is 1, and so forth. The indented block of code makes up the body of the
Negative indices refer to items at the end of the list. To get If you want to work with an element that you're removing loop, where you can work with each individual item. Any
a particular element, write the name of the list and then the from the list, you can "pop" the element. If you think of the lines that are not indented run after the loop is completed.
index of the element in square brackets. list as a stack of items, pop() takes an item off the top of the
stack. By default pop() returns the last element in the list,
Printing all items in a list
Getting the first element but you can also pop elements from any position in the list. for user in users:
first_user = users[0] print(user)
Pop the last item from a list
Getting the second element most_recent_user = users.pop() Printing a message for each item, and a separate
print(most_recent_user) message afterwards
second_user = users[1]
for user in users:
Getting the last element Pop the first item in a list
print("Welcome, " + user + "!")
newest_user = users[-1] first_user = users.pop(0)
print(first_user) print("Welcome, we're glad to see you all!")

Once you've defined a list, you can change individual


elements in the list. You do this by referring to the index of The len() function returns the number of items in a list.
the item you want to modify.
Find the length of a list
Covers Python 3 and Python 2
Changing an element num_users = len(users)
users[0] = 'valerie' print("We have " + str(num_users) + " users.")
users[-2] = 'ronald'
40
You can use the range() function to work with a set of To copy a list make a slice that starts at the first item and A tuple is like a list, except you can't change the values in a
numbers efficiently. The range() function starts at 0 by ends at the last item. If you try to copy a list without using tuple once it's defined. Tuples are good for storing
default, and stops one number below the number passed to this approach, whatever you do to the copied list will affect information that shouldn't be changed throughout the life of
it. You can use the list() function to efficiently generate a the original list as well. a program. Tuples are designated by parentheses instead
large list of numbers. of square brackets. (You can overwrite an entire tuple, but
Making a copy of a list you can't change the individual elements in a tuple.)
Printing the numbers 0 to 1000
finishers = ['kai', 'abe', 'ada', 'gus', 'zoe'] Defining a tuple
for number in range(1001): copy_of_finishers = finishers[:]
print(number) dimensions = (800, 600)

Printing the numbers 1 to 1000 Looping through a tuple


for number in range(1, 1001): You can use a loop to generate a list based on a range of for dimension in dimensions:
print(number) numbers or on another list. This is a common operation, so print(dimension)
Python offers a more efficient way to do it. List
Making a list of numbers from 1 to a million comprehensions may look complicated at first; if so, use the Overwriting a tuple
for loop approach until you're ready to start using dimensions = (800, 600)
numbers = list(range(1, 1000001)) comprehensions. print(dimensions)
To write a comprehension, define an expression for the
values you want to store in the list. Then write a for loop to
dimensions = (1200, 900)
generate input values needed to make the list.
There are a number of simple statistics you can run on a list
containing numerical data. Using a loop to generate a list of square numbers
Finding the minimum value in a list squares = [] When you're first learning about data structures such as
ages = [93, 99, 66, 17, 85, 1, 35, 82, 2, 77] for x in range(1, 11): lists, it helps to visualize how Python is working with the
youngest = min(ages) square = x**2 information in your program. pythontutor.com is a great tool
squares.append(square) for seeing how Python keeps track of the information in a
Finding the maximum value list. Try running the following code on pythontutor.com, and
Using a comprehension to generate a list of square
then run your own code.
ages = [93, 99, 66, 17, 85, 1, 35, 82, 2, 77] numbers
oldest = max(ages) Build a list and print the items in the list
squares = [x**2 for x in range(1, 11)]
Finding the sum of all values dogs = []
Using a loop to convert a list of names to upper case dogs.append('willie')
ages = [93, 99, 66, 17, 85, 1, 35, 82, 2, 77]
names = ['kai', 'abe', 'ada', 'gus', 'zoe'] dogs.append('hootz')
total_years = sum(ages)
dogs.append('peso')
upper_names = [] dogs.append('goblin')
for name in names:
You can work with any set of elements from a list. A portion upper_names.append(name.upper()) for dog in dogs:
of a list is called a slice. To slice a list start with the index of print("Hello " + dog + "!")
Using a comprehension to convert a list of names to print("I love these dogs!")
the first item you want, then add a colon and the index after
the last item you want. Leave off the first index to start at upper case
the beginning of the list, and leave off the last index to slice print("\nThese were my first two dogs:")
names = ['kai', 'abe', 'ada', 'gus', 'zoe']
through the end of the list. old_dogs = dogs[:2]
for old_dog in old_dogs:
Getting the first three items upper_names = [name.upper() for name in names]
print(old_dog)
finishers = ['kai', 'abe', 'ada', 'gus', 'zoe']
first_three = finishers[:3] del dogs[0]
Readability counts dogs.remove('peso')
Getting the middle three items print(dogs)
Use four spaces per indentation level.
middle_three = finishers[1:4]
Keep your lines to 79 characters or fewer.
Getting the last three items Use single blank lines to group parts of your More cheat sheets available at
last_three = finishers[-3:] program visually.
41
You can store as many key-value pairs as you want in a You can loop through a dictionary in three ways: you can
dictionary, until your computer runs out of memory. To add loop through all the key-value pairs, all the keys, or all the
a new key-value pair to an existing dictionary give the name values.
of the dictionary and the new key in square brackets, and A dictionary only tracks the connections between keys
set it equal to the new value. and values; it doesn't track the order of items in the
This also allows you to start with an empty dictionary and dictionary. If you want to process the information in order,
add key-value pairs as they become relevant. you can sort the keys in your loop.

Adding a key-value pair Looping through all key-value pairs


alien_0 = {'color': 'green', 'points': 5} # Store people's favorite languages.
fav_languages = {
alien_0['x'] = 0 'jen': 'python',
Python's dictionaries allow you to connect pieces of
alien_0['y'] = 25 'sarah': 'c',
related information. Each piece of information in a 'edward': 'ruby',
alien_0['speed'] = 1.5
dictionary is stored as a key-value pair. When you 'phil': 'python',
provide a key, Python returns the value associated Adding to an empty dictionary }
with that key. You can loop through all the key-value alien_0 = {}
pairs, all the keys, or all the values. alien_0['color'] = 'green' # Show each person's favorite language.
alien_0['points'] = 5 for name, language in fav_languages.items():
print(name + ": " + language)

Use curly braces to define a dictionary. Use colons to Looping through all the keys
connect keys and values, and use commas to separate You can modify the value associated with any key in a # Show everyone who's taken the survey.
individual key-value pairs. dictionary. To do so give the name of the dictionary and for name in fav_languages.keys():
enclose the key in square brackets, then provide the new print(name)
Making a dictionary
value for that key.
alien_0 = {'color': 'green', 'points': 5} Looping through all the values
Modifying values in a dictionary
# Show all the languages that have been chosen.
alien_0 = {'color': 'green', 'points': 5} for language in fav_languages.values():
print(alien_0) print(language)
To access the value associated with an individual key give
the name of the dictionary and then place the key in a set of # Change the alien's color and point value. Looping through all the keys in order
square brackets. If the key you're asking for is not in the alien_0['color'] = 'yellow'
dictionary, an error will occur. # Show each person's favorite language,
alien_0['points'] = 10 # in order by the person's name.
You can also use the get() method, which returns None print(alien_0)
instead of an error if the key doesn't exist. You can also for name in sorted(fav_languages.keys()):
specify a default value to use if the key is not in the print(name + ": " + language)
dictionary.
Getting the value associated with a key You can remove any key-value pair you want from a
dictionary. To do so use the del keyword and the dictionary
You can find the number of key-value pairs in a dictionary.
alien_0 = {'color': 'green', 'points': 5} name, followed by the key in square brackets. This will
delete the key and its associated value. Finding a dictionary's length
print(alien_0['color'])
print(alien_0['points']) Deleting a key-value pair num_responses = len(fav_languages)
alien_0 = {'color': 'green', 'points': 5}
Getting the value with get()
print(alien_0)
alien_0 = {'color': 'green'}
del alien_0['points']
alien_color = alien_0.get('color') print(alien_0)
alien_points = alien_0.get('points', 0) Covers Python 3 and Python 2

print(alien_color)
print(alien_points) Try running some of these examples on pythontutor.com.
42
It's sometimes useful to store a set of dictionaries in a list; Storing a list inside a dictionary alows you to associate Standard Python dictionaries don't keep track of the order
this is called nesting. more than one value with each key. in which keys and values are added; they only preserve the
association between each key and its value. If you want to
Storing dictionaries in a list Storing lists in a dictionary preserve the order in which keys and values are added, use
# Start with an empty list. # Store multiple languages for each person. an OrderedDict.
users = [] fav_languages = { Preserving the order of keys and values
'jen': ['python', 'ruby'],
# Make a new user, and add them to the list. 'sarah': ['c'], from collections import OrderedDict
new_user = { 'edward': ['ruby', 'go'],
'last': 'fermi', 'phil': ['python', 'haskell'], # Store each person's languages, keeping
'first': 'enrico', } # track of who respoded first.
'username': 'efermi', fav_languages = OrderedDict()
} # Show all responses for each person.
users.append(new_user) for name, langs in fav_languages.items(): fav_languages['jen'] = ['python', 'ruby']
print(name + ": ") fav_languages['sarah'] = ['c']
# Make another new user, and add them as well. for lang in langs: fav_languages['edward'] = ['ruby', 'go']
new_user = { print("- " + lang) fav_languages['phil'] = ['python', 'haskell']
'last': 'curie',
'first': 'marie', # Display the results, in the same order they
'username': 'mcurie', # were entered.
} You can store a dictionary inside another dictionary. In this for name, langs in fav_languages.items():
users.append(new_user) case each value associated with a key is itself a dictionary. print(name + ":")
for lang in langs:
Storing dictionaries in a dictionary print("- " + lang)
# Show all information about each user.
for user_dict in users: users = {
for k, v in user_dict.items(): 'aeinstein': {
print(k + ": " + v) 'first': 'albert',
'last': 'einstein', You can use a loop to generate a large number of
print("\n")
'location': 'princeton', dictionaries efficiently, if all the dictionaries start out with
You can also define a list of dictionaries directly, }, similar data.
without using append(): 'mcurie': { A million aliens
# Define a list of users, where each user 'first': 'marie',
'last': 'curie', aliens = []
# is represented by a dictionary.
users = [ 'location': 'paris',
}, # Make a million green aliens, worth 5 points
{ # each. Have them all start in one row.
'last': 'fermi', }
for alien_num in range(1000000):
'first': 'enrico', new_alien = {}
'username': 'efermi', for username, user_dict in users.items():
print("\nUsername: " + username) new_alien['color'] = 'green'
}, new_alien['points'] = 5
{ full_name = user_dict['first'] + " "
full_name += user_dict['last'] new_alien['x'] = 20 * alien_num
'last': 'curie', new_alien['y'] = 0
'first': 'marie', location = user_dict['location']
aliens.append(new_alien)
'username': 'mcurie',
}, print("\tFull name: " + full_name.title())
print("\tLocation: " + location.title()) # Prove the list contains a million aliens.
] num_aliens = len(aliens)
# Show all information about each user. print("Number of aliens created:")
for user_dict in users: Nesting is extremely useful in certain situations. However, print(num_aliens)
for k, v in user_dict.items(): be aware of making your code overly complex. If you're
print(k + ": " + v) nesting items much deeper than what you see here there
print("\n") are probably simpler ways of managing your data, such as More cheat sheets available at
using classes.
43
Testing numerical values is similar to testing string values. Several kinds of if statements exist. Your choice of which to
use depends on the number of conditions you need to test.
Testing equality and inequality You can have as many elif blocks as you need, and the
>>> age = 18 else block is always optional.
>>> age == 18 Simple if statement
True
>>> age != 18 age = 19
False
if age >= 18:
Comparison operators print("You're old enough to vote!")
>>> age = 19 If-else statements
>>> age < 21
True age = 17
>>> age <= 21
True if age >= 18:
If statements allow you to examine the current state >>> age > 21 print("You're old enough to vote!")
of a program and respond appropriately to that state. False else:
>>> age >= 21 print("You can't vote yet.")
You can write a simple if statement that checks one
False
condition, or you can create a complex series of if The if-elif-else chain
statements that idenitfy the exact conditions you're
age = 12
looking for.
You can check multiple conditions at the same time. The if age < 4:
While loops run as long as certain conditions remain and operator returns True if all the conditions listed are price = 0
true. You can use while loops to let your programs True. The or operator returns True if any condition is True. elif age < 18:
run as long as your users want them to. Using and to check multiple conditions price = 5
else:
>>> age_0 = 22 price = 10
>>> age_1 = 18
A conditional test is an expression that can be evaluated as >>> age_0 >= 21 and age_1 >= 21 print("Your cost is $" + str(price) + ".")
True or False. Python uses the values True and False to False
decide whether the code in an if statement should be >>> age_1 = 23
executed. >>> age_0 >= 21 and age_1 >= 21
True You can easily test whether a certain value is in a list. You
Checking for equality
A single equal sign assigns a value to a variable. A double equal can also test whether a list is empty before trying to loop
Using or to check multiple conditions through the list.
sign (==) checks whether two values are equal.
>>> age_0 = 22 Testing if a value is in a list
>>> car = 'bmw'
>>> age_1 = 18
>>> car == 'bmw' >>> players = ['al', 'bea', 'cyn', 'dale']
>>> age_0 >= 21 or age_1 >= 21
True >>> 'al' in players
True
>>> car = 'audi' True
>>> age_0 = 18
>>> car == 'bmw' >>> 'eric' in players
>>> age_0 >= 21 or age_1 >= 21
False False
False
Ignoring case when making a comparison
>>> car = 'Audi'
>>> car.lower() == 'audi' A boolean value is either True or False. Variables with
True boolean values are often used to keep track of certain
conditions within a program.
Checking for inequality Covers Python 3 and Python 2
Simple boolean values
>>> topping = 'mushrooms'
>>> topping != 'anchovies' game_active = True
True can_edit = False
44
Testing if a value is not in a list Letting the user choose when to quit Using continue in a loop
banned_users = ['ann', 'chad', 'dee'] prompt = "\nTell me something, and I'll " banned_users = ['eve', 'fred', 'gary', 'helen']
user = 'erin' prompt += "repeat it back to you."
prompt += "\nEnter 'quit' to end the program. " prompt = "\nAdd a player to your team."
if user not in banned_users: prompt += "\nEnter 'quit' when you're done. "
print("You can play!") message = ""
while message != 'quit': players = []
Checking if a list is empty message = input(prompt) while True:
players = [] player = input(prompt)
if message != 'quit': if player == 'quit':
if players: print(message) break
for player in players: elif player in banned_users:
Using a flag print(player + " is banned!")
print("Player: " + player.title())
else: prompt = "\nTell me something, and I'll " continue
print("We have no players yet!") prompt += "repeat it back to you." else:
prompt += "\nEnter 'quit' to end the program. " players.append(player)

active = True print("\nYour team:")


You can allow your users to enter input using the input() while active: for player in players:
statement. In Python 3, all input is stored as a string. message = input(prompt) print(player)

Simple input
if message == 'quit':
name = input("What's your name? ") active = False Every while loop needs a way to stop running so it won't
print("Hello, " + name + ".") else: continue to run forever. If there's no way for the condition to
print(message) become False, the loop will never stop running.
Accepting numerical input
Using break to exit a loop An infinite loop
age = input("How old are you? ")
age = int(age) prompt = "\nWhat cities have you visited?" while True:
prompt += "\nEnter 'quit' when you're done. " name = input("\nWho are you? ")
if age >= 18: print("Nice to meet you, " + name + "!")
print("\nYou can vote!") while True:
else: city = input(prompt)
print("\nYou can't vote yet.")
if city == 'quit': The remove() method removes a specific value from a list,
Accepting input in Python 2.7 break
Use raw_input() in Python 2.7. This function interprets all input as a
but it only removes the first instance of the value you
string, just as input() does in Python 3.
else: provide. You can use a while loop to remove all instances
print("I've been to " + city + "!") of a particular value.
name = raw_input("What's your name? ")
print("Hello, " + name + ".") Removing all cats from a list of pets
pets = ['dog', 'cat', 'dog', 'fish', 'cat',
Sublime Text doesn't run programs that prompt the user for
'rabbit', 'cat']
input. You can use Sublime Text to write programs that
prompt for input, but you'll need to run these programs from print(pets)
A while loop repeats a block of code as long as a condition
is True. a terminal.
while 'cat' in pets:
Counting to 5 pets.remove('cat')

current_number = 1 print(pets)
You can use the break statement and the continue
statement with any of Python's loops. For example you can
while current_number <= 5: use break to quit a for loop that's working through a list or a
print(current_number) dictionary. You can use continue to skip over certain items More cheat sheets available at
current_number += 1 when looping through a list or dictionary as well.
45
The two main kinds of arguments are positional and A function can return a value or a set of values. When a
keyword arguments. When you use positional arguments function returns a value, the calling line must provide a
Python matches the first argument in the function call with variable in which to store the return value. A function stops
the first parameter in the function definition, and so forth. running when it reaches a return statement.
With keyword arguments, you specify which parameter
each argument should be assigned to in the function call. Returning a single value
When you use keyword arguments, the order of the def get_full_name(first, last):
arguments doesn't matter. """Return a neatly formatted full name."""
Using positional arguments full_name = first + ' ' + last
return full_name.title()
def describe_pet(animal, name):
Functions are named blocks of code designed to do """Display information about a pet.""" musician = get_full_name('jimi', 'hendrix')
print("\nI have a " + animal + ".") print(musician)
one specific job. Functions allow you to write code
print("Its name is " + name + ".")
once that can then be run whenever you need to Returning a dictionary
accomplish the same task. Functions can take in the describe_pet('hamster', 'harry') def build_person(first, last):
information they need, and return the information they describe_pet('dog', 'willie') """Return a dictionary of information
generate. Using functions effectively makes your about a person.
programs easier to write, read, test, and fix. Using keyword arguments
"""
def describe_pet(animal, name): person = {'first': first, 'last': last}
"""Display information about a pet.""" return person
The first line of a function is its definition, marked by the print("\nI have a " + animal + ".")
keyword def. The name of the function is followed by a set print("Its name is " + name + ".") musician = build_person('jimi', 'hendrix')
of parentheses and a colon. A docstring, in triple quotes, print(musician)
describes what the function does. The body of a function is describe_pet(animal='hamster', name='harry')
describe_pet(name='willie', animal='dog') Returning a dictionary with optional values
indented one level.
To call a function, give the name of the function followed def build_person(first, last, age=None):
by a set of parentheses. """Return a dictionary of information
about a person.
Making a function You can provide a default value for a parameter. When
"""
function calls omit this argument the default value will be
def greet_user(): used. Parameters with default values must be listed after person = {'first': first, 'last': last}
"""Display a simple greeting.""" parameters without default values in the function's definition if age:
print("Hello!") so positional arguments can still work correctly. person['age'] = age
return person
greet_user() Using a default value
musician = build_person('jimi', 'hendrix', 27)
def describe_pet(name, animal='dog'):
print(musician)
"""Display information about a pet."""
Information that's passed to a function is called an print("\nI have a " + animal + ".")
musician = build_person('janis', 'joplin')
argument; information that's received by a function is called print("Its name is " + name + ".")
print(musician)
a parameter. Arguments are included in parentheses after
the function's name, and parameters are listed in describe_pet('harry', 'hamster')
parentheses in the function's definition. describe_pet('willie')
Try running some of these examples on pythontutor.com.
Passing a single argument Using None to make an argument optional
def greet_user(username): def describe_pet(animal, name=None):
"""Display a simple greeting.""" """Display information about a pet."""
print("Hello, " + username + "!") print("\nI have a " + animal + ".")
if name:
print("Its name is " + name + ".")
Covers Python 3 and Python 2
greet_user('jesse')
greet_user('diana')
greet_user('brandon') describe_pet('hamster', 'harry')
describe_pet('snake')
46
You can pass a list as an argument to a function, and the Sometimes you won't know how many arguments a You can store your functions in a separate file called a
function can work with the values in the list. Any changes function will need to accept. Python allows you to collect an module, and then import the functions you need into the file
the function makes to the list will affect the original list. You arbitrary number of arguments into one parameter using the containing your main program. This allows for cleaner
can prevent a function from modifying a list by passing a * operator. A parameter that accepts an arbitrary number of program files. (Make sure your module is stored in the
copy of the list as an argument. arguments must come last in the function definition. same directory as your main program.)
The ** operator allows a parameter to collect an arbitrary
Passing a list as an argument number of keyword arguments. Storing a function in a module
File: pizza.py
def greet_users(names): Collecting an arbitrary number of arguments
"""Print a simple greeting to everyone.""" def make_pizza(size, *toppings):
for name in names: def make_pizza(size, *toppings): """Make a pizza."""
msg = "Hello, " + name + "!" """Make a pizza.""" print("\nMaking a " + size + " pizza.")
print(msg) print("\nMaking a " + size + " pizza.") print("Toppings:")
print("Toppings:") for topping in toppings:
usernames = ['hannah', 'ty', 'margot'] for topping in toppings: print("- " + topping)
greet_users(usernames) print("- " + topping)
Importing an entire module
Allowing a function to modify a list File: making_pizzas.py
# Make three pizzas with different toppings. Every function in the module is available in the program file.
The following example sends a list of models to a function for
make_pizza('small', 'pepperoni')
printing. The original list is emptied, and the second list is filled. import pizza
make_pizza('large', 'bacon bits', 'pineapple')
def print_models(unprinted, printed): make_pizza('medium', 'mushrooms', 'peppers',
"""3d print a set of models.""" 'onions', 'extra cheese') pizza.make_pizza('medium', 'pepperoni')
while unprinted: pizza.make_pizza('small', 'bacon', 'pineapple')
current_model = unprinted.pop() Collecting an arbitrary number of keyword arguments
Importing a specific function
print("Printing " + current_model) def build_profile(first, last, **user_info): Only the imported functions are available in the program file.
printed.append(current_model) """Build a user's profile dictionary."""
# Build a dict with the required keys. from pizza import make_pizza
# Store some unprinted designs, profile = {'first': first, 'last': last}
# and print each of them. make_pizza('medium', 'pepperoni')
unprinted = ['phone case', 'pendant', 'ring'] # Add any other keys and values. make_pizza('small', 'bacon', 'pineapple')
printed = [] for key, value in user_info.items():
print_models(unprinted, printed) Giving a module an alias
profile[key] = value
import pizza as p
print("\nUnprinted:", unprinted) return profile
print("Printed:", printed) p.make_pizza('medium', 'pepperoni')
# Create two users with different kinds p.make_pizza('small', 'bacon', 'pineapple')
Preventing a function from modifying a list
The following example is the same as the previous one, except the # of information.
user_0 = build_profile('albert', 'einstein',
Giving a function an alias
original list is unchanged after calling print_models().
location='princeton') from pizza import make_pizza as mp
def print_models(unprinted, printed): user_1 = build_profile('marie', 'curie',
"""3d print a set of models.""" location='paris', field='chemistry') mp('medium', 'pepperoni')
while unprinted: mp('small', 'bacon', 'pineapple')
current_model = unprinted.pop() print(user_0)
print("Printing " + current_model) print(user_1) Importing all functions from a module
printed.append(current_model) Don't do this, but recognize it when you see it in others' code. It
can result in naming conflicts, which can cause errors.
# Store some unprinted designs, from pizza import *
# and print each of them. As you can see there are many ways to write and call a
original = ['phone case', 'pendant', 'ring'] function. When you're starting out, aim for something that make_pizza('medium', 'pepperoni')
printed = [] simply works. As you gain experience you'll develop an make_pizza('small', 'bacon', 'pineapple')
understanding of the more subtle advantages of different
print_models(original[:], printed) structures such as positional and keyword arguments, and
print("\nOriginal:", original) the various approaches to importing functions. For now if More cheat sheets available at
print("Printed:", printed) your functions do what you need them to, you're doing well.
47
If the class you're writing is a specialized version of another
Creating an object from a class class, you can use inheritance. When one class inherits
my_car = Car('audi', 'a4', 2016) from another, it automatically takes on all the attributes and
methods of the parent class. The child class is free to
Accessing attribute values introduce new attributes and methods, and override
attributes and methods of the parent class.
print(my_car.make)
To inherit from another class include the name of the
print(my_car.model)
parent class in parentheses when defining the new class.
print(my_car.year)
Classes are the foundation of object-oriented The __init__() method for a child class
Calling methods
programming. Classes represent real-world things
class ElectricCar(Car):
you want to model in your programs: for example my_car.fill_tank()
"""A simple model of an electric car."""
dogs, cars, and robots. You use a class to make my_car.drive()
objects, which are specific instances of dogs, cars, Creating multiple objects def __init__(self, make, model, year):
and robots. A class defines the general behavior that """Initialize an electric car."""
a whole category of objects can have, and the my_car = Car('audi', 'a4', 2016) super().__init__(make, model, year)
information that can be associated with those objects. my_old_car = Car('subaru', 'outback', 2013)
my_truck = Car('toyota', 'tacoma', 2010) # Attributes specific to electric cars.
Classes can inherit from each other – you can
write a class that extends the functionality of an # Battery capacity in kWh.
existing class. This allows you to code efficiently for a self.battery_size = 70
# Charge level in %.
wide variety of situations. You can modify an attribute's value directly, or you can
self.charge_level = 0
write methods that manage updating values more carefully.
Modifying an attribute directly Adding new methods to the child class
Consider how we might model a car. What information class ElectricCar(Car):
my_new_car = Car('audi', 'a4', 2016)
would we associate with a car, and what behavior would it --snip--
my_new_car.fuel_level = 5
have? The information is stored in variables called def charge(self):
attributes, and the behavior is represented by functions. Writing a method to update an attribute's value """Fully charge the vehicle."""
Functions that are part of a class are called methods. self.charge_level = 100
def update_fuel_level(self, new_level): print("The vehicle is fully charged.")
The Car class """Update the fuel level."""
if new_level <= self.fuel_capacity: Using child methods and parent methods
class Car():
self.fuel_level = new_level
"""A simple attempt to model a car.""" my_ecar = ElectricCar('tesla', 'model s', 2016)
else:
print("The tank can't hold that much!")
def __init__(self, make, model, year): my_ecar.charge()
"""Initialize car attributes.""" Writing a method to increment an attribute's value my_ecar.drive()
self.make = make
self.model = model def add_fuel(self, amount):
self.year = year """Add fuel to the tank."""
if (self.fuel_level + amount
# Fuel capacity and level in gallons. <= self.fuel_capacity): There are many ways to model real world objects and
self.fuel_capacity = 15 self.fuel_level += amount situations in code, and sometimes that variety can feel
self.fuel_level = 0 print("Added fuel.") overwhelming. Pick an approach and try it – if your first
else: attempt doesn't work, try a different approach.
def fill_tank(self): print("The tank won't hold that much.")
"""Fill gas tank to capacity."""
self.fuel_level = self.fuel_capacity
print("Fuel tank is full.")
In Python class names are written in CamelCase and object Covers Python 3 and Python 2
def drive(self): names are written in lowercase with underscores. Modules
"""Simulate driving.""" that contain classes should still be named in lowercase with
print("The car is moving.") underscores.
48
Class files can get long as you add detailed information and
Overriding parent methods functionality. To help keep your program files uncluttered, Classes should inherit from object
class ElectricCar(Car): you can store your classes in modules and import the class ClassName(object):
--snip-- classes you need into your main program.
def fill_tank(self): The Car class in Python 2.7
Storing classes in a file
"""Display an error message.""" car.py class Car(object):
print("This car has no fuel tank!")
"""Represent gas and electric cars.""" Child class __init__() method is different
class ChildClassName(ParentClass):
class Car():
def __init__(self):
A class can have objects as attributes. This allows classes """A simple attempt to model a car."""
super(ClassName, self).__init__()
to work together to model complex situations. --snip—
The ElectricCar class in Python 2.7
A Battery class class Battery():
"""A battery for an electric car.""" class ElectricCar(Car):
class Battery(): def __init__(self, make, model, year):
"""A battery for an electric car.""" --snip--
super(ElectricCar, self).__init__(
class ElectricCar(Car): make, model, year)
def __init__(self, size=70):
"""Initialize battery attributes.""" """A simple model of an electric car."""
# Capacity in kWh, charge level in %. --snip--
self.size = size Importing individual classes from a module A list can hold as many items as you want, so you can
self.charge_level = 0 my_cars.py make a large number of objects from a class and store
them in a list.
def get_range(self): from car import Car, ElectricCar Here's an example showing how to make a fleet of rental
"""Return the battery's range.""" cars, and make sure all the cars are ready to drive.
if self.size == 70: my_beetle = Car('volkswagen', 'beetle', 2016)
my_beetle.fill_tank() A fleet of rental cars
return 240
elif self.size == 85: my_beetle.drive() from car import Car, ElectricCar
return 270
my_tesla = ElectricCar('tesla', 'model s', # Make lists to hold a fleet of cars.
Using an instance as an attribute 2016) gas_fleet = []
my_tesla.charge() electric_fleet = []
class ElectricCar(Car):
my_tesla.drive()
--snip--
Importing an entire module # Make 500 gas cars and 250 electric cars.
def __init__(self, make, model, year): for _ in range(500):
"""Initialize an electric car.""" import car car = Car('ford', 'focus', 2016)
super().__init__(make, model, year) gas_fleet.append(car)
my_beetle = car.Car( for _ in range(250):
# Attribute specific to electric cars. 'volkswagen', 'beetle', 2016) ecar = ElectricCar('nissan', 'leaf', 2016)
self.battery = Battery() my_beetle.fill_tank() electric_fleet.append(ecar)
my_beetle.drive()
def charge(self): # Fill the gas cars, and charge electric cars.
"""Fully charge the vehicle.""" my_tesla = car.ElectricCar( for car in gas_fleet:
self.battery.charge_level = 100 'tesla', 'model s', 2016) car.fill_tank()
print("The vehicle is fully charged.") my_tesla.charge() for ecar in electric_fleet:
my_tesla.drive() ecar.charge()
Using the instance
Importing all classes from a module
my_ecar = ElectricCar('tesla', 'model x', 2016) (Don’t do this, but recognize it when you see it.) print("Gas cars:", len(gas_fleet))
print("Electric cars:", len(electric_fleet))
from car import *
my_ecar.charge()
print(my_ecar.battery.get_range()) More cheat sheets available at
my_beetle = Car('volkswagen', 'beetle', 2016)
my_ecar.drive()
49
Storing the lines in a list Opening a file using an absolute path
filename = 'siddhartha.txt' f_path = "/home/ehmatthes/books/alice.txt"

with open(filename) as f_obj: with open(f_path) as f_obj:


lines = f_obj.readlines() lines = f_obj.readlines()

for line in lines: Opening a file on Windows


Windows will sometimes interpret forward slashes incorrectly. If
print(line.rstrip())
you run into this, use backslashes in your file paths.
f_path = "C:\Users\ehmatthes\books\alice.txt"
Your programs can read information in from files, and Passing the 'w' argument to open() tells Python you want to
they can write data to files. Reading from files allows with open(f_path) as f_obj:
write to the file. Be careful; this will erase the contents of lines = f_obj.readlines()
you to work with a wide variety of information; writing the file if it already exists. Passing the 'a' argument tells
to files allows users to pick up where they left off the Python you want to append to the end of an existing file.
next time they run your program. You can write text to
files, and you can store Python structures such as Writing to an empty file When you think an error may occur, you can write a try-
lists in data files. filename = 'programming.txt' except block to handle the exception that might be raised.
The try block tells Python to try running some code, and the
with open(filename, 'w') as f: except block tells Python what to do if the code results in a
Exceptions are special objects that help your
f.write("I love programming!") particular kind of error.
programs respond to errors in appropriate ways. For
example if your program tries to open a file that Writing multiple lines to an empty file Handling the ZeroDivisionError exception
doesn’t exist, you can use exceptions to display an try:
informative error message instead of having the filename = 'programming.txt'
print(5/0)
program crash. except ZeroDivisionError:
with open(filename, 'w') as f:
f.write("I love programming!\n") print("You can't divide by zero!")
f.write("I love creating new games.\n") Handling the FileNotFoundError exception
To read from a file your program needs to open the file and
Appending to a file f_name = 'siddhartha.txt'
then read the contents of the file. You can read the entire
contents of the file at once, or read the file line by line. The filename = 'programming.txt'
with statement makes sure the file is closed properly when try:
the program has finished accessing the file. with open(filename, 'a') as f: with open(f_name) as f_obj:
f.write("I also love working with data.\n") lines = f_obj.readlines()
Reading an entire file at once f.write("I love making apps as well.\n") except FileNotFoundError:
filename = 'siddhartha.txt' msg = "Can't find file {0}.".format(f_name)
print(msg)
with open(filename) as f_obj:
contents = f_obj.read() When Python runs the open() function, it looks for the file in
the same directory where the program that's being excuted
is stored. You can open a file from a subfolder using a It can be hard to know what kind of exception to handle
print(contents) when writing code. Try writing your code without a try block,
relative path. You can also use an absolute path to open
Reading line by line any file on your system. and make it generate an error. The traceback will tell you
Each line that's read from the file has a newline character at the what kind of exception your program needs to handle.
end of the line, and the print function adds its own newline Opening a file from a subfolder
character. The rstrip() method gets rid of the the extra blank lines
this would result in when printing to the terminal.
f_path = "text_files/alice.txt"

filename = 'siddhartha.txt' with open(f_path) as f_obj:


lines = f_obj.readlines()
Covers Python 3 and Python 2
with open(filename) as f_obj:
for line in f_obj: for line in lines:
print(line.rstrip()) print(line.rstrip())
50
The try block should only contain code that may cause an Sometimes you want your program to just continue running The json module allows you to dump simple Python data
error. Any code that depends on the try block running when it encounters an error, without reporting the error to structures into a file, and load the data from that file the
successfully should be placed in the else block. the user. Using the pass statement in an else block allows next time the program runs. The JSON data format is not
you to do this. specific to Python, so you can share this kind of data with
Using an else block people who work in other languages as well.
Using the pass statement in an else block
print("Enter two numbers. I'll divide them.")
f_names = ['alice.txt', 'siddhartha.txt', Knowing how to manage exceptions is important when
x = input("First number: ") 'moby_dick.txt', 'little_women.txt'] working with stored data. You'll usually want to make sure
y = input("Second number: ") the data you're trying to load exists before working with it.
for f_name in f_names: Using json.dump() to store data
try: # Report the length of each file found.
result = int(x) / int(y) try: """Store some numbers."""
except ZeroDivisionError: with open(f_name) as f_obj:
print("You can't divide by zero!") lines = f_obj.readlines() import json
else: except FileNotFoundError:
print(result) # Just move on to the next file. numbers = [2, 3, 5, 7, 11, 13]
pass
Preventing crashes from user input else: filename = 'numbers.json'
Without the except block in the following example, the program with open(filename, 'w') as f_obj:
num_lines = len(lines)
would crash if the user tries to divide by zero. As written, it will
handle the error gracefully and keep running. msg = "{0} has {1} lines.".format( json.dump(numbers, f_obj)
f_name, num_lines)
"""A simple calculator for division only.""" print(msg)
Using json.load() to read data
"""Load some previously stored numbers."""
print("Enter two numbers. I'll divide them.")
print("Enter 'q' to quit.") import json
Exception-handling code should catch specific exceptions
while True: that you expect to happen during your program's execution. filename = 'numbers.json'
x = input("\nFirst number: ") A bare except block will catch all exceptions, including with open(filename) as f_obj:
if x == 'q': keyboard interrupts and system exits you might need when numbers = json.load(f_obj)
break forcing a program to close.
y = input("Second number: ") print(numbers)
if y == 'q': If you want to use a try block and you're not sure which
break exception to catch, use Exception. It will catch most Making sure the stored data exists
exceptions, but still allow you to interrupt programs
intentionally. import json
try:
result = int(x) / int(y) Don’t use bare except blocks f_name = 'numbers.json'
except ZeroDivisionError:
print("You can't divide by zero!") try:
try:
else: # Do something
with open(f_name) as f_obj:
print(result) except:
numbers = json.load(f_obj)
pass
except FileNotFoundError:
Use Exception instead msg = "Can’t find {0}.".format(f_name)
Well-written, properly tested code is not very prone to print(msg)
try: else:
internal errors such as syntax or logical errors. But every # Do something
time your program depends on something external such as print(numbers)
except Exception:
user input or the existence of a file, there's a possibility of pass
an exception being raised. Practice with exceptions
Printing the exception Take a program you've already written that prompts for user
It's up to you how to communicate errors to your users. input, and add some error-handling code to the program.
Sometimes users need to know if a file is missing; try:
sometimes it's better to handle the error silently. A little # Do something
except Exception as e: More cheat sheets available at
experience will help you know how much to report.
print(e, type(e))
51
Building a testcase with one unit test Running the test
To build a test case, make a class that inherits from When you change your code, it’s important to run your existing
unittest.TestCase and write methods that begin with test_. tests. This will tell you whether the changes you made affected
Save this as test_full_names.py existing behavior.

import unittest E
from full_names import get_full_name ================================================
ERROR: test_first_last (__main__.NamesTestCase)
class NamesTestCase(unittest.TestCase): Test names like Janis Joplin.
"""Tests for names.py.""" ------------------------------------------------
Traceback (most recent call last):
def test_first_last(self): File "test_full_names.py", line 10,
When you write a function or a class, you can also """Test names like Janis Joplin.""" in test_first_last
write tests for that code. Testing proves that your full_name = get_full_name('janis', 'joplin')
code works as it's supposed to in the situations it's 'joplin') TypeError: get_full_name() missing 1 required
designed to handle, and also when people use your self.assertEqual(full_name, positional argument: 'last'
programs in unexpected ways. Writing tests gives 'Janis Joplin')
you confidence that your code will work correctly as ------------------------------------------------
more people begin to use your programs. You can unittest.main() Ran 1 test in 0.001s
also add new features to your programs and know Running the test
that you haven't broken existing behavior. FAILED (errors=1)
Python reports on each unit test in the test case. The dot reports a
single passing test. Python informs us that it ran 1 test in less than Fixing the code
A unit test verifies that one specific aspect of your 0.001 seconds, and the OK lets us know that all unit tests in the When a test fails, the code needs to be modified until the test
test case passed. passes again. (Don’t make the mistake of rewriting your tests to fit
code works as it's supposed to. A test case is a
your new code.) Here we can make the middle name optional.
collection of unit tests which verify your code's .
behavior in a wide variety of situations. --------------------------------------- def get_full_name(first, last, middle=''):
Ran 1 test in 0.000s """Return a full name."""
if middle:
OK full_name = "{0} {1} {2}".format(first,
Python's unittest module provides tools for testing your middle, last)
code. To try it out, we’ll create a function that returns a full else:
name. We’ll use the function in a regular program, and then full_name = "{0} {1}".format(first,
build a test case for the function. Failing tests are important; they tell you that a change in the
last)
code has affected existing behavior. When a test fails, you
A function to test return full_name.title()
need to modify the code so the existing behavior still works.
Save this as full_names.py
Modifying the function Running the test
def get_full_name(first, last): Now the test should pass again, which means our original
We’ll modify get_full_name() so it handles middle names, but
"""Return a full name.""" functionality is still intact.
we’ll do it in a way that breaks existing behavior.
full_name = "{0} {1}".format(first, last) .
return full_name.title() def get_full_name(first, middle, last):
---------------------------------------
"""Return a full name."""
Ran 1 test in 0.000s
Using the function full_name = "{0} {1} {2}".format(first,
Save this as names.py middle, last)
OK
return full_name.title()
from full_names import get_full_name
Using the function
janis = get_full_name('janis', 'joplin')
print(janis) from full_names import get_full_name

bob = get_full_name('bob', 'dylan') john = get_full_name('john', 'lee', 'hooker')


print(john)
Covers Python 3 and Python 2
print(bob)

david = get_full_name('david', 'lee', 'roth')


print(david)
52
You can add as many unit tests to a test case as you need. Testing a class is similar to testing a function, since you’ll When testing a class, you usually have to make an instance
To write a new test, add a new method to your test case mostly be testing your methods. of the class. The setUp() method is run before every test.
class. Any instances you make in setUp() are available in every
A class to test test you write.
Testing middle names Save as accountant.py
We’ve shown that get_full_name() works for first and last Using setUp() to support multiple tests
names. Let’s test that it works for middle names as well.
class Accountant():
The instance self.acc can be used in each new test.
"""Manage a bank account."""
import unittest import unittest
from full_names import get_full_name def __init__(self, balance=0): from accountant import Accountant
self.balance = balance
class NamesTestCase(unittest.TestCase): class TestAccountant(unittest.TestCase):
"""Tests for names.py.""" def deposit(self, amount): """Tests for the class Accountant."""
self.balance += amount
def test_first_last(self): def setUp(self):
"""Test names like Janis Joplin.""" def withdraw(self, amount): self.acc = Accountant()
full_name = get_full_name('janis', self.balance -= amount
'joplin') def test_initial_balance(self):
self.assertEqual(full_name, Building a testcase # Default balance should be 0.
'Janis Joplin') For the first test, we’ll make sure we can start out with different
initial balances. Save this as test_accountant.py. self.assertEqual(self.acc.balance, 0)

def test_middle(self): import unittest # Test non-default balance.


"""Test names like David Lee Roth.""" from accountant import Accountant acc = Accountant(100)
full_name = get_full_name('david', self.assertEqual(acc.balance, 100)
'roth', 'lee') class TestAccountant(unittest.TestCase):
self.assertEqual(full_name, """Tests for the class Accountant.""" def test_deposit(self):
'David Lee Roth') # Test single deposit.
def test_initial_balance(self): self.acc.deposit(100)
unittest.main() # Default balance should be 0. self.assertEqual(self.acc.balance, 100)
acc = Accountant()
Running the tests self.assertEqual(acc.balance, 0)
The two dots represent two passing tests. # Test multiple deposits.
self.acc.deposit(100)
.. # Test non-default balance. self.acc.deposit(100)
--------------------------------------- acc = Accountant(100) self.assertEqual(self.acc.balance, 300)
Ran 2 tests in 0.000s self.assertEqual(acc.balance, 100)
def test_withdrawal(self):
OK unittest.main() # Test single withdrawal.
self.acc.deposit(1000)
Running the test
self.acc.withdraw(100)
. self.assertEqual(self.acc.balance, 900)
Python provides a number of assert methods you can use
to test your code. ---------------------------------------
Ran 1 test in 0.000s unittest.main()
Verify that a==b, or a != b
OK Running the tests
assertEqual(a, b)
assertNotEqual(a, b) ...
---------------------------------------
Verify that x is True, or x is False In general you shouldn’t modify a test once it’s written. Ran 3 tests in 0.001s
assertTrue(x) When a test fails it usually means new code you’ve written
has broken existing functionality, and you need to modify OK
assertFalse(x)
the new code until all existing tests pass.
Verify an item is in a list, or not in a list If your original requirements have changed, it may be
appropriate to modify some tests. This usually happens in More cheat sheets available at
assertIn(item, list) the early stages of a project when desired behavior is still
assertNotIn(item, list) being sorted out.
53
The following code sets up an empty game window, and
starts an event loop and a loop that continually refreshes Useful rect attributes
Once you have a rect object, there are a number of attributes that
the screen.
are useful when positioning objects and detecting relative positions
An empty game window of objects. (You can find more attributes in the Pygame
documentation.)
import sys
# Individual x and y values:
import pygame as pg
screen_rect.left, screen_rect.right
screen_rect.top, screen_rect.bottom
Pygame is a framework for making games using def run_game():
screen_rect.centerx, screen_rect.centery
Python. Making games is fun, and it’s a great way to # Initialize and set up screen.
screen_rect.width, screen_rect.height
expand your programming skills and knowledge. pg.init()
Pygame takes care of many of the lower-level tasks screen = pg.display.set_mode((1200, 800))
# Tuples
in building games, which lets you focus on the pg.display.set_caption("Alien Invasion")
screen_rect.center
aspects of your game that make it interesting. screen_rect.size
# Start main loop.
while True: Creating a rect object
# Start event loop. You can create a rect object from scratch. For example a small rect
Pygame runs on all systems, but setup is slightly different for event in pg.event.get(): object that’s filled in can represent a bullet in a game. The Rect()
if event.type == pg.QUIT: class takes the coordinates of the upper left corner, and the width
on each OS. The instructions here assume you’re using and height of the rect. The draw.rect() function takes a screen
Python 3, and provide a minimal installation of Pygame. If sys.exit()
object, a color, and a rect. This function fills the given rect with the
these instructions don’t work for your system, see the more given color.
detailed notes at http://ehmatthes.github.io/pcc/. # Refresh screen.
pg.display.flip() bullet_rect = pg.Rect(100, 100, 3, 15)
Pygame on Linux color = (100, 100, 100)
$ sudo apt-get install python3-dev mercurial run_game() pg.draw.rect(screen, color, bullet_rect)
libsdl-image1.2-dev libsdl2-dev Setting a custom window size
libsdl-ttf2.0-dev The display.set_mode() function accepts a tuple that defines the
$ pip install --user screen size. Many objects in a game are images that are moved around
hg+http://bitbucket.org/pygame/pygame the screen. It’s easiest to use bitmap (.bmp) image files, but
screen_dim = (1200, 800)
screen = pg.display.set_mode(screen_dim) you can also configure your system to work with jpg, png,
Pygame on OS X and gif files as well.
This assumes you’ve used Homebrew to install Python 3.
Setting a custom background color
$ brew install hg sdl sdl_image sdl_ttf Colors are defined as a tuple of red, green, and blue values. Each Loading an image
$ pip install --user value ranges from 0-255. ship = pg.image.load('images/ship.bmp')
hg+http://bitbucket.org/pygame/pygame bg_color = (230, 230, 230)
Getting the rect object from an image
Pygame on Windows screen.fill(bg_color)
Find an installer at ship_rect = ship.get_rect()
https://bitbucket.org/pygame/pygame/downloads/ or
http://www.lfd.uci.edu/~gohlke/pythonlibs/#pygame that matches Positioning an image
your version of Python. Run the installer file if it’s a .exe or .msi file. Many objects in a game can be treated as simple With rects, it’s easy to position an image wherever you want on the
If it’s a .whl file, use pip to install Pygame: rectangles, rather than their actual shape. This simplifies screen, or in relation to another object. The following code
positions a ship object at the bottom center of the screen.
> python –m pip install --user code without noticeably affecting game play. Pygame has a
pygame-1.9.2a0-cp35-none-win32.whl rect object that makes it easy to work with game objects. ship_rect.midbottom = screen_rect.midbottom

Testing your installation Getting the screen rect object


We already have a screen object; we can easily access the rect
To test your installation, open a terminal session and try to import
object associated with the screen.
Pygame. If you don’t get any error messages, your installation was
successful. screen_rect = screen.get_rect()
$ python Covers Python 3 and Python 2
Finding the center of the screen
>>> import pygame Rect objects have a center attribute which stores the center point.
>>>
screen_center = screen_rect.center
54
Pygame’s event loop registers an event any time the
Drawing an image to the screen mouse moves, or a mouse button is pressed or released. Removing an item from a group
Once an image is loaded and positioned, you can draw it to the It’s important to delete elements that will never appear again in the
screen with the blit() method. The blit() method acts on the screen Responding to the mouse button game, so you don’t waste memory and resources.
object, and takes the image object and image rect as arguments.
for event in pg.event.get(): bullets.remove(bullet)
# Draw ship to screen. if event.type == pg.MOUSEBUTTONDOWN:
screen.blit(ship, ship_rect) ship.fire_bullet()
The blitme() method Finding the mouse position You can detect when a single object collides with any
Game objects such as ships are often written as classes. Then a The mouse position is returned as a tuple. member of a group. You can also detect when any member
blitme() method is usually defined, which draws the object to the of one group collides with a member of another group.
screen. mouse_pos = pg.mouse.get_pos()
def blitme(self):
Collisions between a single object and a group
Clicking a button The spritecollideany() function takes an object and a group, and
"""Draw ship at current location.""" You might want to know if the cursor is over an object such as a returns True if the object overlaps with any member of the group.
self.screen.blit(self.image, self.rect) button. The rect.collidepoint() method returns true when a point is
inside a rect object. if pg.sprite.spritecollideany(ship, aliens):
ships_left -= 1
if button_rect.collidepoint(mouse_pos):
Pygame watches for events such as key presses and start_game() Collisions between two groups
mouse actions. You can detect any event you care about in The sprite.groupcollide() function takes two groups, and two
Hiding the mouse booleans. The function returns a dictionary containing information
the event loop, and respond with any action that’s
about the members that have collided. The booleans tell Pygame
appropriate for your game. pg.mouse.set_visible(False) whether to delete the members of either group that have collided.
Responding to key presses collisions = pg.sprite.groupcollide(
Pygame’s main event loop registers a KEYDOWN event any time a bullets, aliens, True, True)
key is pressed. When this happens, you can check for specific Pygame has a Group class which makes working with a
keys.
group of similar objects easier. A group is like a list, with score += len(collisions) * alien_point_value
for event in pg.event.get(): some extra functionality that’s helpful when building games.
if event.type == pg.KEYDOWN:
Making and filling a group
if event.key == pg.K_RIGHT: An object that will be placed in a group must inherit from Sprite.
ship_rect.x += 1 You can use text for a variety of purposes in a game. For
elif event.key == pg.K_LEFT: from pygame.sprite import Sprite, Group example you can share information with players, and you
ship_rect.x -= 1 can display a score.
elif event.key == pg.K_SPACE: def Bullet(Sprite): Displaying a message
ship.fire_bullet() ... The following code defines a message, then a color for the text and
elif event.key == pg.K_q: def draw_bullet(self): the background color for the message. A font is defined using the
sys.exit() ... default system font, with a font size of 48. The font.render()
def update(self): function is used to create an image of the message, and we get the
Responding to released keys ... rect object associated with the image. We then center the image
When the user releases a key, a KEYUP event is triggered. on the screen and display it.

if event.type == pg.KEYUP: bullets = Group() msg = "Play again?"


if event.key == pg.K_RIGHT: msg_color = (100, 100, 100)
ship.moving_right = False new_bullet = Bullet() bg_color = (230, 230, 230)
bullets.add(new_bullet)
f = pg.font.SysFont(None, 48)
Looping through the items in a group msg_image = f.render(msg, True, msg_color,
The Pygame documentation is really helpful when building The sprites() method returns all the members of a group.
bg_color)
your own games. The home page for the Pygame project is for bullet in bullets.sprites(): msg_image_rect = msg_image.get_rect()
at http://pygame.org/, and the home page for the bullet.draw_bullet() msg_image_rect.center = screen_rect.center
documentation is at http://pygame.org/docs/. screen.blit(msg_image, msg_image_rect)
The most useful part of the documentation are the pages Calling update() on a group
about specific parts of Pygame, such as the Rect() class Calling update() on a group automatically calls update() on each
and the sprite module. You can find a list of these elements member of the group.
More cheat sheets available at
at the top of the help pages. bullets.update()
55
Making a scatter plot Emphasizing points
The scatter() function takes a list of x values and a list of y values, You can plot as much data as you want on one plot. Here we re-
and a variety of optional arguments. The s=10 argument controls plot the first and last points larger to emphasize them.
the size of each point.
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
x_values = list(range(1000))
x_values = list(range(1000)) squares = [x**2 for x in x_values]
squares = [x**2 for x in x_values] plt.scatter(x_values, squares, c=squares,
cmap=plt.cm.Blues, edgecolor='none',
plt.scatter(x_values, squares, s=10) s=10)
plt.show()
Data visualization involves exploring data through plt.scatter(x_values[0], squares[0], c='green',
visual representations. The matplotlib package helps edgecolor='none', s=100)
you make visually appealing representations of the plt.scatter(x_values[-1], squares[-1], c='red',
data you’re working with. matplotlib is extremely Plots can be customized in a wide variety of ways. Just
edgecolor='none', s=100)
flexible; these examples will help you get started with about any element of a plot can be customized.
a few simple visualizations. Adding titles and labels, and scaling axes plt.title("Square Numbers", fontsize=24)
--snip--
import matplotlib.pyplot as plt
Removing axes
matplotlib runs on all systems, but setup is slightly different x_values = list(range(1000)) You can customize or remove axes entirely. Here’s how to access
depending on your OS. If the minimal instructions here squares = [x**2 for x in x_values] each axis, and hide it.
don’t work for you, see the more detailed instructions at plt.scatter(x_values, squares, s=10) plt.axes().get_xaxis().set_visible(False)
http://ehmatthes.github.io/pcc/. You should also consider plt.axes().get_yaxis().set_visible(False)
installing the Anaconda distrubution of Python from plt.title("Square Numbers", fontsize=24)
https://continuum.io/downloads/, which includes matplotlib. plt.xlabel("Value", fontsize=18) Setting a custom figure size
plt.ylabel("Square of Value", fontsize=18) You can make your plot as big or small as you want. Before
matplotlib on Linux plt.tick_params(axis='both', which='major', plotting your data, add the following code. The dpi argument is
optional; if you don’t know your system’s resolution you can omit
$ sudo apt-get install python3-matplotlib labelsize=14)
the argument and adjust the figsize argument accordingly.
plt.axis([0, 1100, 0, 1100000])
matplotlib on OS X plt.figure(dpi=128, figsize=(10, 6))
Start a terminal session and enter import matplotlib to see if plt.show()
it’s already installed on your system. If not, try this command: Saving a plot
Using a colormap The matplotlib viewer has an interactive save button, but you can
$ pip install --user matplotlib also save your visualizations programmatically. To do so, replace
A colormap varies the point colors from one shade to another,
based on a certain value for each point. The value used to plt.show() with plt.savefig(). The bbox_inches='tight'
matplotlib on Windows argument trims extra whitespace from the plot.
You first need to install Visual Studio, which you can do from determine the color of each point is passed to the c argument, and
https://dev.windows.com/. The Community edition is free. Then go the cmap argument specifies which colormap to use.
plt.savefig('squares.png', bbox_inches='tight')
to https://pypi.python.org/pypi/matplotlib/ or The edgecolor='none' argument removes the black outline
http://www.lfd.uic.edu/~gohlke/pythonlibs/#matplotlib and download from each point.
an appropriate installer file.
plt.scatter(x_values, squares, c=squares,
cmap=plt.cm.Blues, edgecolor='none', The matplotlib gallery and documentation are at
s=10) http://matplotlib.org/. Be sure to visit the examples, gallery,
and pyplot links.
Making a line graph
import matplotlib.pyplot as plt

x_values = [0, 1, 2, 3, 4, 5]
squares = [0, 1, 4, 9, 16, 25]
Covers Python 3 and Python 2
plt.plot(x_values, squares)
plt.show()
56
You can make as many plots as you want on one figure. You can include as many individual graphs in one figure as
When you make multiple plots, you can emphasize Datetime formatting arguments you want. This is useful, for example, when comparing
The strftime() function generates a formatted string from a
relationships in the data. For example you can fill the space related datasets.
datetime object, and the strptime() function genereates a
between two sets of data. datetime object from a string. The following codes let you work with Sharing an x-axis
Plotting two sets of data dates exactly as you need to. The following code plots a set of squares and a set of cubes on
Here we use plt.scatter() twice to plot square numbers and %A Weekday name, such as Monday two separate graphs that share a common x-axis.
cubes on the same figure. The plt.subplots() function returns a figure object and a tuple
%B Month name, such as January of axes. Each set of axes corresponds to a separate plot in the
import matplotlib.pyplot as plt %m Month, as a number (01 to 12) figure. The first two arguments control the number of rows and
%d Day of the month, as a number (01 to 31) columns generated in the figure.
x_values = list(range(11)) %Y Four-digit year, such as 2016
%y Two-digit year, such as 16 import matplotlib.pyplot as plt
squares = [x**2 for x in x_values]
cubes = [x**3 for x in x_values] %H Hour, in 24-hour format (00 to 23)
%I Hour, in 12-hour format (01 to 12) x_vals = list(range(11))
%p AM or PM squares = [x**2 for x in x_vals]
plt.scatter(x_values, squares, c='blue',
%M Minutes (00 to 59) cubes = [x**3 for x in x_vals]
edgecolor='none', s=20)
plt.scatter(x_values, cubes, c='red', %S Seconds (00 to 61)
fig, axarr = plt.subplots(2, 1, sharex=True)
edgecolor='none', s=20)
Converting a string to a datetime object
axarr[0].scatter(x_vals, squares)
plt.axis([0, 11, 0, 1100]) new_years = dt.strptime('1/1/2017', '%m/%d/%Y')
axarr[0].set_title('Squares')
plt.show()
Converting a datetime object to a string
Filling the space between data sets axarr[1].scatter(x_vals, cubes, c='red')
The fill_between() method fills the space between two data ny_string = dt.strftime(new_years, '%B %d, %Y') axarr[1].set_title('Cubes')
sets. It takes a series of x-values and two series of y-values. It also print(ny_string)
takes a facecolor to use for the fill, and an optional alpha plt.show()
argument that controls the color’s transparency. Plotting high temperatures
The following code creates a list of dates and a corresponding list
Sharing a y-axis
plt.fill_between(x_values, cubes, squares, of high temperatures. It then plots the high temperatures, with the
To share a y-axis, we use the sharey=True argument.
facecolor='blue', alpha=0.25) date labels displayed in a specific format.
from datetime import datetime as dt import matplotlib.pyplot as plt

import matplotlib.pyplot as plt x_vals = list(range(11))


Many interesting data sets have a date or time as the x- squares = [x**2 for x in x_vals]
value. Python’s datetime module helps you work with this from matplotlib import dates as mdates
cubes = [x**3 for x in x_vals]
kind of data.
dates = [
fig, axarr = plt.subplots(1, 2, sharey=True)
Generating the current date dt(2016, 6, 21), dt(2016, 6, 22),
The datetime.now() function returns a datetime object dt(2016, 6, 23), dt(2016, 6, 24),
representing the current date and time. ] axarr[0].scatter(x_vals, squares)
axarr[0].set_title('Squares')
from datetime import datetime as dt
highs = [57, 68, 64, 59]
axarr[1].scatter(x_vals, cubes, c='red')
today = dt.now() axarr[1].set_title('Cubes')
date_string = dt.strftime(today, '%m/%d/%Y') fig = plt.figure(dpi=128, figsize=(10,6))
print(date_string) plt.plot(dates, highs, c='red')
plt.title("Daily High Temps", fontsize=24) plt.show()
Generating a specific date plt.ylabel("Temp (F)", fontsize=16)
You can also generate a datetime object for any date and time you
want. The positional order of arguments is year, month, and day. x_axis = plt.axes().get_xaxis()
The hour, minute, second, and microsecond arguments are
x_axis.set_major_formatter(
optional.
mdates.DateFormatter('%B %d %Y')
from datetime import datetime as dt )
fig.autofmt_xdate()
new_years = dt(2017, 1, 1) More cheat sheets available at
fall_equinox = dt(year=2016, month=9, day=22) plt.show()
57
You can add as much data as you want when making a
Making a scatter plot visualization.
The data for a scatter plot needs to be a list containing tuples of
the form (x, y). The stroke=False argument tells Pygal to make Plotting squares and cubes
an XY chart with no line connecting the points.
import pygal
import pygal
x_values = list(range(11))
squares = [ squares = [x**2 for x in x_values]
(0, 0), (1, 1), (2, 4), (3, 9), cubes = [x**3 for x in x_values]
Data visualization involves exploring data through (4, 16), (5, 25),
visual representations. Pygal helps you make visually ] chart = pygal.Line()
appealing representations of the data you’re working
chart.force_uri_protocol = 'http'
with. Pygal is particularly well suited for visualizations chart = pygal.XY(stroke=False) chart.title = "Squares and Cubes"
that will be presented online, because it supports chart.force_uri_protocol = 'http' chart.x_labels = x_values
interactive elements. chart.add('x^2', squares)
chart.render_to_file('squares.svg') chart.add('Squares', squares)
Using a list comprehension for a scatter plot chart.add('Cubes', cubes)
Pygal can be installed using pip. A list comprehension can be used to effficiently make a dataset for chart.render_to_file('squares_cubes.svg')
a scatter plot.
Pygal on Linux and OS X Filling the area under a data series
squares = [(x, x**2) for x in range(1000)] Pygal allows you to fill the area under or over each series of data.
$ pip install --user pygal The default is to fill from the x-axis up, but you can fill from any
Making a bar graph horizontal line using the zero argument.
Pygal on Windows A bar graph requires a list of values for the bar sizes. To label the
bars, pass a list of the same length to x_labels. chart = pygal.Line(fill=True, zero=0)
> python –m pip install --user pygal
import pygal

outcomes = [1, 2, 3, 4, 5, 6]
To make a plot with Pygal, you specify the kind of plot and frequencies = [18, 16, 18, 17, 18, 13]
then add the data.
chart = pygal.Bar()
Making a line graph
To view the output, open the file squares.svg in a browser.
chart.force_uri_protocol = 'http'
chart.x_labels = outcomes
import pygal chart.add('D6', frequencies)
chart.render_to_file('rolling_dice.svg')
x_values = [0, 1, 2, 3, 4, 5]
squares = [0, 1, 4, 9, 16, 25] Making a bar graph from a dictionary
Since each bar needs a label and a value, a dictionary is a great The documentation for Pygal is available at
way to store the data for a bar graph. The keys are used as the http://www.pygal.org/.
chart = pygal.Line() labels along the x-axis, and the values are used to determine the
chart.force_uri_protocol = 'http' height of each bar.
chart.add('x^2', squares)
chart.render_to_file('squares.svg') import pygal If you’re viewing svg output in a browser, Pygal needs to
render the output file in a specific way. The
Adding labels and a title results = { force_uri_protocol attribute for chart objects needs to
--snip-- 1:18, 2:16, 3:18, be set to 'http'.
chart = pygal.Line() 4:17, 5:18, 6:13,
chart.force_uri_protocol = 'http' }
chart.title = "Squares"
chart.x_labels = x_values chart = pygal.Bar()
chart.x_title = "Value" chart.force_uri_protocol = 'http' Covers Python 3 and Python 2
chart.y_title = "Square of Value" chart.x_labels = results.keys()
chart.add('x^2', squares) chart.add('D6', results.values())
chart.render_to_file('squares.svg') chart.render_to_file('rolling_dice.svg')
58
Pygal lets you customize many elements of a plot. There Pygal can generate world maps, and you can add any data
are some excellent default themes, and many options for Configuration settings you want to these maps. Data is indicated by coloring, by
Some settings are controlled by a Config object.
styling individual plot elements. labels, and by tooltips that show data when users hover
my_config = pygal.Config() over each country on the map.
Using built-in styles
To use built-in styles, import the style and make an instance of the
my_config.show_y_guides = False
Installing the world map module
style class. Then pass the style object with the style argument my_config.width = 1000 The world map module is not included by default in Pygal 2.0. It
when you make the chart object. my_config.dots_size = 5 can be installed with pip:
import pygal chart = pygal.Line(config=my_config) $ pip install --user pygal_maps_world
from pygal.style import LightGreenStyle --snip-- Making a world map
x_values = list(range(11)) Styling series The following code makes a simple world map showing the
You can give each series on a chart different style settings. countries of North America.
squares = [x**2 for x in x_values]
cubes = [x**3 for x in x_values] chart.add('Squares', squares, dots_size=2) from pygal.maps.world import World
chart.add('Cubes', cubes, dots_size=3)
chart_style = LightGreenStyle() wm = World()
chart = pygal.Line(style=chart_style) Styling individual data points wm.force_uri_protocol = 'http'
chart.force_uri_protocol = 'http' You can style individual data points as well. To do so, write a wm.title = 'North America'
chart.title = "Squares and Cubes" dictionary for each data point you want to customize. A 'value' wm.add('North America', ['ca', 'mx', 'us'])
chart.x_labels = x_values key is required, and other properies are optional.
import pygal wm.render_to_file('north_america.svg')
chart.add('Squares', squares)
chart.add('Cubes', cubes) Showing all the country codes
repos = [ In order to make maps, you need to know Pygal’s country codes.
chart.render_to_file('squares_cubes.svg') { The following example will print an alphabetical list of each country
'value': 20506, and its code.
Parametric built-in styles
Some built-in styles accept a custom color, then generate a theme 'color': '#3333CC',
'xlink': 'http://djangoproject.com/', from pygal.maps.world import COUNTRIES
based on that color.
},
from pygal.style import LightenStyle 20054, for code in sorted(COUNTRIES.keys()):
12607, print(code, COUNTRIES[code])
--snip-- 11827,
chart_style = LightenStyle('#336688')
Plotting numerical data on a world map
] To plot numerical data on a map, pass a dictionary to add()
chart = pygal.Line(style=chart_style) instead of a list.
--snip-- chart = pygal.Bar()
chart.force_uri_protocol = 'http' from pygal.maps.world import World
Customizing individual style properties
Style objects have a number of properties you can set individually. chart.x_labels = [
populations = {
'django', 'requests', 'scikit-learn',
chart_style = LightenStyle('#336688') 'ca': 34126000,
'tornado',
chart_style.plot_background = '#CCCCCC' ] 'us': 309349000,
chart_style.major_label_font_size = 20 chart.y_title = 'Stars' 'mx': 113423000,
chart_style.label_font_size = 16 chart.add('Python Repos', repos) }
--snip-- chart.render_to_file('python_repos.svg')
wm = World()
Custom style class wm.force_uri_protocol = 'http'
You can start with a bare style class, and then set only the wm.title = 'Population of North America'
properties you care about. wm.add('North America', populations)
chart_style = Style()
chart_style.colors = [ wm.render_to_file('na_populations.svg')
'#CCCCCC', '#AAAAAA', '#888888']
chart_style.plot_background = '#EEEEEE'

chart = pygal.Line(style=chart_style) More cheat sheets available at


--snip--
59
The data in a Django project is structured as a set of Users interact with a project through web pages, and a
models. project’s home page can start out as a simple page with no
data. A page usually needs a URL, a view, and a template.
Defining a model
To define the models for your app, modify the file models.py that Mapping a project’s URLs
was created in your app’s folder. The __str__() method tells The project’s main urls.py file tells Django where to find the urls.py
Django how to represent data objects based on this model. files associated with each app in the project.
from django.db import models from django.conf.urls import include, url
from django.contrib import admin
Django is a web framework which helps you build class Topic(models.Model):
interactive websites using Python. With Django you """A topic the user is learning about.""" urlpatterns = [
define the kind of data your site needs to work with, text = models.CharField(max_length=200) url(r'^admin/', include(admin.site.urls)),
and you define the ways your users can work with date_added = models.DateTimeField( url(r'', include('learning_logs.urls',
that data. auto_now_add=True) namespace='learning_logs')),
]
def __str__(self):
return self.text Mapping an app’s URLs
It’s usualy best to install Django to a virtual environment, An app’s urls.py file tells Django which view to use for each URL in
where your project can be isolated from your other Python Activating a model the app. You’ll need to make this file yourself, and save it in the
projects. Most commands assume you’re working in an To use a model the app must be added to the tuple app’s folder.
active virtual environment. INSTALLED_APPS, which is stored in the project’s settings.py file.
from django.conf.urls import url
Create a virtual environment INSTALLED_APPS = (
--snip-- from . import views
$ python –m venv ll_env 'django.contrib.staticfiles',
Activate the environment (Linux and OS X) urlpatterns = [
# My apps url(r'^$', views.index, name='index'),
$ source ll_env/bin/activate 'learning_logs', ]
)
Activate the environment (Windows) Writing a simple view
Migrating the database A view takes information from a request and sends data to the
> ll_env\Scripts\activate browser, often through a template. View functions are stored in an
The database needs to be modified to store the kind of data that
the model represents. app’s views.py file. This simple view function doesn’t pull in any
Install Django to the active environment data, but it uses the template index.html to render the home page.
(ll_env)$ pip install Django $ python manage.py makemigrations learning_logs
$ python manage.py migrate from django.shortcuts import render

Creating a superuser def index(request):


A superuser is a user account that has access to all aspects of the """The home page for Learning Log."""
To start a project we’ll create a new project, create a project.
database, and start a development server. return render(request,
$ python manage.py createsuperuser 'learning_logs/index.html')
Create a new project
Registering a model
$ django-admin.py startproject learning_log . You can register your models with Django’s admin site, which
makes it easier to work with the data in your project. To do this,
Create a database modify the app’s admin.py file. View the admin site at The documentation for Django is available at
$ python manage.py migrate http://localhost:8000/admin/. http://docs.djangoproject.com/. The Django documentation
is thorough and user-friendly, so check it out!
View the project from django.contrib import admin
After issuing this command, you can view the project at
http://localhost:8000/. from learning_logs.models import Topic

$ python manage.py runserver admin.site.register(Topic) Covers Python 3 and Python 2


Create a new app
A Django project is made up of one or more apps.
$ python manage.py startapp learning_logs
60
A new model can use an existing model. The ForeignKey
Writing a simple template attribute establishes a connection between instances of the Using data in a template
A template sets up the structure for a page. It’s a mix of html and The data in the view function’s context dictionary is available
two related models. Make sure to migrate the database
template code, which is like Python but not as powerful. Make a within the template. This data is accessed using template
folder called templates inside the project folder. Inside the after adding a new model to your app. variables, which are indicated by doubled curly braces.
templates folder make another folder with the same name as the The vertical line after a template variable indicates a filter. In this
app. This is where the template files should be saved.
Defining a model with a foreign key
case a filter called date formats date objects, and the filter
class Entry(models.Model): linebreaks renders paragraphs properly on a web page.
<p>Learning Log</p>
"""Learning log entries for a topic.""" {% extends 'learning_logs/base.html' %}
topic = models.ForeignKey(Topic)
<p>Learning Log helps you keep track of your
text = models.TextField() {% block content %}
learning, for any topic you're learning
date_added = models.DateTimeField(
about.</p>
auto_now_add=True) <p>Topic: {{ topic }}</p>
def __str__(self): <p>Entries:</p>
Many elements of a web page are repeated on every page return self.text[:50] + "..." <ul>
in the site, or every page in a section of the site. By writing {% for entry in entries %}
one parent template for the site, and one for each section, <li>
you can easily modify the look and feel of your entire site. <p>
Most pages in a project need to present data that’s specific
The parent template to the current user. {{ entry.date_added|date:'M d, Y H:i' }}
The parent template defines the elements common to a set of </p>
pages, and defines blocks that will be filled by individual pages. URL parameters <p>
A URL often needs to accept a parameter telling it which data to {{ entry.text|linebreaks }}
<p> access from the database. The second URL pattern shown here </p>
<a href="{% url 'learning_logs:index' %}"> looks for the ID of a specific topic and stores it in the parameter </li>
Learning Log topic_id.
{% empty %}
</a> urlpatterns = [ <li>There are no entries yet.</li>
</p> url(r'^$', views.index, name='index'), {% endfor %}
url(r'^topics/(?P<topic_id>\d+)/$', </ul>
{% block content %}{% endblock content %} views.topic, name='topic'),
The child template ] {% endblock content %}
The child template uses the {% extends %} template tag to pull in
the structure of the parent template. It then defines the content for
Using data in a view
The view uses a parameter from the URL to pull the correct data
any blocks defined in the parent template.
from the database. In this example the view is sending a context
{% extends 'learning_logs/base.html' %} dictionary to the template, containing data that should be displayed You can explore the data in your project from the command
on the page. line. This is helpful for developing queries and testing code
{% block content %} def topic(request, topic_id): snippets.
<p> """Show a topic and all its entries.""" Start a shell session
Learning Log helps you keep track topic = Topics.objects.get(id=topic_id)
of your learning, for any topic you're entries = topic.entry_set.order_by( $ python manage.py shell
learning about. '-date_added')
</p> Access data from the project
context = {
{% endblock content %} 'topic': topic, >>> from learning_logs.models import Topic
'entries': entries, >>> Topic.objects.all()
} [<Topic: Chess>, <Topic: Rock Climbing>]
return render(request, >>> topic = Topic.objects.get(id=1)
'learning_logs/topic.html', context) >>> topic.text
'Chess'

Python code is usually indented by four spaces. In


templates you’ll often see two spaces used for indentation, If you make a change to your project and the change
doesn’t seem to have any effect, try restarting the server: More cheat sheets available at
because elements tend to be nested more deeply in
templates. $ python manage.py runserver
61
Defining the URLs Showing the current login status
Users will need to be able to log in, log out, and register. Make a You can modify the base.html template to show whether the user is
new urls.py file in the users app folder. The login view is a default currently logged in, and to provide a link to the login and logout
view provided by Django. pages. Django makes a user object available to every template,
and this template takes advantage of this object.
from django.conf.urls import url The user.is_authenticated tag allows you to serve specific
from django.contrib.auth.views import login content to users depending on whether they have logged in or not.
The {{ user.username }} property allows you to greet users
from . import views who have logged in. Users who haven’t logged in see links to
register or log in.
urlpatterns = [ <p>
url(r'^login/$', login, <a href="{% url 'learning_logs:index' %}">
Most web applications need to let users create {'template_name': 'users/login.html'}, Learning Log
accounts. This lets users create and work with their name='login'), </a>
own data. Some of this data may be private, and url(r'^logout/$', views.logout_view, {% if user.is_authenticated %}
name='logout'), Hello, {{ user.username }}.
some may be public. Django’s forms allow users to
url(r'^register/$', views.register, <a href="{% url 'users:logout' %}">
enter and modify their data. name='register'), log out
] </a>
The login template {% else %}
User accounts are handled by a dedicated app called <a href="{% url 'users:register' %}">
users. Users need to be able to register, log in, and log The login view is provided by default, but you need to provide your
own login template. The template shown here displays a simple register
out. Django automates much of this work for you. login form, and provides basic error messages. Make a templates </a> -
Making a users app folder in the users folder, and then make a users folder in the <a href="{% url 'users:login' %}">
templates folder. Save this file as login.html. log in
After making the app, be sure to add 'users' to INSTALLED_APPS
The tag {% csrf_token %} helps prevent a common type of
in the project’s settings.py file. </a>
attack with forms. The {{ form.as_p }} element displays the
{% endif %}
$ python manage.py startapp users default login form in paragraph format. The <input> element
named next redirects the user to the home page after a successful </p>
Including URLS for the users app login.
Add a line to the project’s urls.py file so the users app’s URLs are {% block content %}{% endblock content %}
included in the project. {% extends "learning_logs/base.html" %}
The logout view
urlpatterns = [ {% block content %} The logout_view() function uses Django’s logout() function
url(r'^admin/', include(admin.site.urls)), {% if form.errors %} and then redirects the user back to the home page. Since there is
url(r'^users/', include('users.urls', no logout page, there is no logout template. Make sure to write this
<p>
namespace='users')), code in the views.py file that’s stored in the users app folder.
Your username and password didn't match.
url(r'', include('learning_logs.urls', Please try again. from django.http import HttpResponseRedirect
namespace='learning_logs')), </p> from django.core.urlresolvers import reverse
] {% endif %} from django.contrib.auth import logout

<form method="post" def logout_view(request):


action="{% url 'users:login' %}"> """Log the user out."""
{% csrf token %} logout(request)
There are a number of ways to create forms and work with {{ form.as_p }} return HttpResponseRedirect(
them. You can use Django’s defaults, or completely <button name="submit">log in</button> reverse('learning_logs:index'))
customize your forms. For a simple way to let users enter
data based on your models, use a ModelForm. This creates <input type="hidden" name="next"
a form that allows users to enter data that will populate the value="{% url 'learning_logs:index' %}"/>
fields on a model. </form>
The register view on the back of this sheet shows a simple Covers Python 3 and Python 2
approach to form processing. If the view doesn’t receive {% endblock content %}
data from a form, it responds with a blank form. If it
receives POST data from a form, it validates the data and
then saves it to the database.
62
The register view The register template Restricting access to logged-in users
The register view needs to display a blank registration form when The register template displays the registration form in paragraph Some pages are only relevant to registered users. The views for
the page is first requested, and then process completed formats. these pages can be protected by the @login_required decorator.
registration forms. A successful registration logs the user in and Any view with this decorator will automatically redirect non-logged
redirects to the home page. {% extends 'learning_logs/base.html' %} in users to an appropriate page. Here’s an example views.py file.
from django.contrib.auth import login {% block content %} from django.contrib.auth.decorators import /
from django.contrib.auth import authenticate login_required
from django.contrib.auth.forms import \ <form method='post' --snip--
UserCreationForm action="{% url 'users:register' %}">
@login_required
def register(request): {% csrf_token %} def topic(request, topic_id):
"""Register a new user.""" {{ form.as_p }} """Show a topic and all its entries."""
if request.method != 'POST':
# Show blank registration form. Setting the redirect URL
<button name='submit'>register</button> The @login_required decorator sends unauthorized users to the
form = UserCreationForm() <input type='hidden' name='next' login page. Add the following line to your project’s settings.py file
else: value="{% url 'learning_logs:index' %}"/> so Django will know how to find your login page.
# Process completed form.
form = UserCreationForm( </form> LOGIN_URL = '/users/login/'
data=request.POST)
Preventing inadvertent access
{% endblock content %} Some pages serve data based on a parameter in the URL. You
if form.is_valid(): can check that the current user owns the requested data, and
new_user = form.save() return a 404 error if they don’t. Here’s an example view.
# Log in, redirect to home page.
pw = request.POST['password1'] Users will have data that belongs to them. Any model that should from django.http import Http404
authenticated_user = authenticate( be connected directly to a user needs a field connecting instances
of the model to a specific user. --snip--
username=new_user.username,
password=pw def topic(request, topic_id):
Making a topic belong to a user """Show a topic and all its entries."""
) Only the highest-level data in a hierarchy needs to be directly
login(request, authenticated_user) connected to a user. To do this import the User model, and add it topic = Topics.objects.get(id=topic_id)
return HttpResponseRedirect( as a foreign key on the data model. if topic.owner != request.user:
reverse('learning_logs:index')) After modifying the model you’ll need to migrate the database. raise Http404
You’ll need to choose a user ID to connect each existing instance --snip--
to.
context = {'form': form}
return render(request, from django.db import models
'users/register.html', context) from django.contrib.auth.models import User
If you provide some initial data, Django generates a form
class Topic(models.Model): with the user’s existing data. Users can then modify and
"""A topic the user is learning about.""" save their data.
The django-bootstrap3 app allows you to use the Bootstrap text = models.CharField(max_length=200)
library to make your project look visually appealing. The Creating a form with initial data
date_added = models.DateTimeField(
app provides tags that you can use in your templates to The instance parameter allows you to specify initial data for a form.
auto_now_add=True)
style individual elements on a page. Learn more at owner = models.ForeignKey(User) form = EntryForm(instance=entry)
http://django-bootstrap3.readthedocs.io/.
def __str__(self): Modifying data before saving
The argument commit=False allows you to make changes before
return self.text writing data to the database.
Heroku lets you push your project to a live server, making it Querying data for the current user new_topic = form.save(commit=False)
available to anyone with an internet connection. Heroku In a view, the request object has a user attribute. You can use this new_topic.owner = request.user
offers a free service level, which lets you learn the attribute to query for the user’s data. The filter() function then
new_topic.save()
deployment process without any commitment. You’ll need pulls the data that belongs to the current user.
to install a set of heroku tools, and use git to track the state topics = Topic.objects.filter(
of your project. See http://devcenter.heroku.com/, and click
More cheat sheets available at
owner=request.user)
on the Python link.
63
freeworld.posterous.com

Linux Bash Shell


Cheat Sheet

(works with about every distribution, except for apt-get which is Ubuntu/Debian exclusive)

Legend:

Everything in “<>” is to be replaced, ex: <fileName> --> iLovePeanuts.txt


Don't include the '=' in your commands
'..' means that more than one file can be affected with only one command ex: rm
file.txt file2.txt movie.mov .. ..
64
Linux Bash Shell Cheat Sheet
Basic Commands

Basic Terminal Shortcuts Basic file manipulation


CTRL L = Clear the terminal cat <fileName> = show content of file
CTRL D = Logout (less, more)
SHIFT Page Up/Down = Go up/down the terminal head = from the top
CTRL A = Cursor to start of line -n <#oflines> <fileName>
CTRL E = Cursor the end of line
CTRL U = Delete left of the cursor tail = from the bottom
CTRL K = Delete right of the cursor -n <#oflines> <fileName>
CTRL W = Delete word on the left
CTRL Y = Paste (after CTRL U,K or W) mkdir = create new folder
TAB = auto completion of file or command mkdir myStuff ..
CTRL R = reverse search history mkdir myStuff/pictures/ ..
!! = repeat last command
CTRL Z = stops the current command (resume with fg in foreground or bg in background) cp image.jpg newimage.jpg = copy and rename a file
Basic Terminal Navigation cp image.jpg <folderName>/ = copy to folder
cp image.jpg folder/sameImageNewName.jpg
ls -a = list all files and folders cp -R stuff otherStuff = copy and rename a folder
ls <folderName> = list files in folder cp *.txt stuff/ = copy all of *<file type> to folder
ls -lh = Detailed list, Human readable
ls -l *.jpg = list jpeg files only mv file.txt Documents/ = move file to a folder
ls -lh <fileName> = Result for file only mv <folderName> <folderName2> = move folder in folder
mv filename.txt filename2.txt = rename file
cd <folderName> = change directory mv <fileName> stuff/newfileName
if folder name has spaces use “ “ mv <folderName>/ .. = move folder up in hierarchy
cd / = go to root
cd .. = go up one folder, tip: ../../../ rm <fileName> .. = delete file (s)
rm -i <fileName> .. = ask for confirmation each file
du -h: Disk usage of folders, human readable rm -f <fileName> = force deletion of a file
du -ah: “ “ “ files & folders, Human readable rm -r <foldername>/ = delete folder
du -sh: only show disc usage of folders
touch <fileName> = create or update a file
pwd = print working directory
ln file1 file2 = physical link
man <command> = shows manual (RTFM) ln -s file1 file2 = symbolic link
65
Linux Bash Shell Cheat Sheet
Basic Commands

Researching Files Extract, sort and filter data


The slow method (sometimes very slow): grep <someText> <fileName> = search for text in file
-i = Doesn't consider uppercase words
locate <text> = search the content of all the files -I = exclude binary files
locate <fileName> = search for a file grep -r <text> <folderName>/ = search for file names
sudo updatedb = update database of files with occurrence of the text

find = the best file search tool (fast) With regular expressions:
find -name “<fileName>”
find -name “text” = search for files who start with the word text grep -E ^<text> <fileName> = search start of lines
find -name “*text” = “ “ “ “ end “ “ “ “ with the word text
grep -E <0-4> <fileName> =shows lines containing numbers 0-4
Advanced Search: grep -E <a-zA-Z> <fileName> = retrieve all lines
with alphabetical letters
Search from file Size (in ~)
find ~ -size +10M = search files bigger than.. (M,K,G) sort = sort the content of files
sort <fileName> = sort alphabetically
Search from last access sort -o <file> <outputFile> = write result to a file
find -name “<filetype>” -atime -5 sort -r <fileName> = sort in reverse
('-' = less than, '+' = more than and nothing = exactly) sort -R <fileName> = sort randomly
sort -n <fileName> = sort numbers
Search only files or directory’s
find -type d --> ex: find /var/log -name "syslog" -type d wc = word count
find -type f = files wc <fileName> = nbr of line, nbr of words, byte size
-l (lines), -w (words), -c (byte size), -m
More info: man find, man locate (number of characters)

cut = cut a part of a file


-c --> ex: cut -c 2-5 names.txt
(cut the characters 2 to 5 of each line)
-d (delimiter) (-d & -f good for .csv files)
-f (# of field to cut)

more info: man cut, man sort, man grep


66
Linux Bash Shell Cheat Sheet
Basic Commands

Time settings (continued)


date = view & modify time (on your computer) crontab = execute a command regularly
-e = modify the crontab
View: -l = view current crontab
date “+%H” --> If it's 9 am, then it will show 09 -r = delete you crontab
date “+%H:%M:%Ss” = (hours, minutes, seconds) In crontab the syntax is
%Y = years <Minutes> <Hours> <Day of month> <Day of week (0-6,
Modify: 0 = Sunday)> <COMMAND>
MMDDhhmmYYYY
Month | Day | Hours | Minutes | Year ex, create the file movies.txt every day at 15:47:
47 15 * * * touch /home/bob/movies.txt
sudo date 031423421997 = March 14 th 1997, 23:42 * * * * * --> every minute
at 5:30 in the morning, from the 1st to 15th each month:
Execute programs at another time 30 5 1-15 * *
at midnight on Mondays, Wednesdays and Thursdays:
use 'at' to execute programs in the future 0 0 * * 1,3,4
every two hours:
Step 1, write in the terminal: at <timeOfExecution> ENTER 0 */2 * * *
ex --> at 16:45 or at 13:43 7/23/11 (to be more precise) every 10 minutes Monday to Friday:
or after a certain delay: */10 * * * 1-5
at now +5 minutes (hours, days, weeks, months, years)
Step 2: <ENTER COMMAND> ENTER Execute programs in the background
repeat step 2 as many times you need
Step 3: CTRL D to close input Add a '&' at the end of a command
ex --> cp bigMovieFile.mp4 &
atq = show a list of jobs waiting to be executed
nohup: ignores the HUP signal when closing the console
atrm = delete a job n°<x> (process will still run if the terminal is closed)
ex (delete job #42) --> atrm 42 ex --> nohup cp bigMovieFile.mp4

sleep = pause between commands jobs = know what is running in the background
with ';' you can chain commands, ex: touch file; rm file
you can make a pause between commands (minutes, hours, days) fg = put a background process to foreground
ex --> touch file; sleep 10; rm file <-- 10 seconds ex: fg (process 1), f%2 (process 2) f%3, ...
67
Linux Bash Shell Cheat Sheet
Basic Commands

Process Management Create and modify user accounts


w = who is logged on and what they are doing sudo adduser bob = root creates new user
sudo passwd <AccountName> = change a user's password
tload = graphic representation of system load average sudo deluser <AccountName> = delete an account
(quit with CTRL C)
addgroup friends = create a new user group
ps = Static process list delgroup friends = delete a user group
-ef --> ex: ps -ef | less
-ejH --> show process hierarchy usermod -g friends <Account> = add user to a group
-u --> process's from current user usermod -g bob boby = change account name
usermod -aG friends bob = add groups to a user with-
top = Dynamic process list out loosing the ones he's already in
While in top:
• q to close top File Permissions
• h to show the help
• k to kill a process chown = change the owner of a file
ex --> chown bob hello.txt
CTRL C to top a current terminal process chown user:bob report.txt = changes the user owning
report.txt to 'user' and the group owning it to 'bob'
kill = kill a process -R = recursively affect all the sub folders
You need the PID # of the process ex --> chown -R bob:bob /home/Daniel
ps -u <AccountName> | grep <Application>
Then chmod = modify user access/permission – simple way
kill <PID> .. .. .. u = user
kill -9 <PID> = violent kill g = group
o = other
killall = kill multiple process's
ex --> killall locate d = directory (if element is a directory)
l = link (if element is a file link)
extras: r = read (read permissions)
sudo halt <-- to close computer w = write (write permissions)
sudo reboot <-- to reboot x = eXecute (only useful for scripts and
programs)
68
Linux Bash Shell Cheat Sheet
Basic Commands

File Permissions (continued) Flow Redirection (continued)


'+' means add a right terminal output:
'-' means delete a right Alex
'=' means affect a right Cinema
Code
ex --> chmod g+w someFile.txt Game
(add to current group the right to modify someFile.txt) Ubuntu

more info: man chmod Another example --> wc -m << END

Flow redirection Chain commands


Redirect results of commands: '|' at the end of a command to enter another one
ex --> du | sort -nr | less
'>' at the end of a command to redirect the result to a file
ex --> ps -ejH > process.txt Archive and compress data
'>>' to redirect the result to the end of a file
Archive and compress data the long way:
Redirect errors:
Step 1, put all the files you want to compress in
'2>' at the end of the command to redirect the result to a file the same folder: ex --> mv *.txt folder/
ex --> cut -d , -f 1 file.csv > file 2> errors.log
'2>&1' to redirect the errors the same way as the standard output Step 2, Create the tar file:
tar -cvf my_archive.tar folder/
Read progressively from the keyboard -c : creates a .tar archive
-v : tells you what is happening (verbose)
<Command> << <wordToTerminateInput> -f : assembles the archive into one file
ex --> sort << END <-- This can be anything you want
> Hello Step 3.1, create gzip file (most current):
> Alex gzip my_archive.tar
> Cinema to decompress: gunzip my_archive.tar.gz
> Game
> Code Step 3.2, or create a bzip2 file (more powerful but slow):
> Ubuntu bzip2 my_archive.tar
> END to decompress: bunzip2 my_archive.tar.bz2
69
Linux Bash Shell Cheat Sheet
Basic Commands

Archive and compress data (continued) Installing software


step 4, to decompress the .tar file: When software is available in the repositories:
tar -xvf archive.tar archive.tar sudo apt-get install <nameOfSoftware>
ex--> sudo apt-get install aptitude
Archive and compress data the fast way:
If you download it from the Internets in .gz format
gzip: tar -zcvf my_archive.tar.gz folder/ (or bz2) - “Compiling from source”
decompress: tar -zcvf my_archive.tar.gz Documents/ Step 1, create a folder to place the file:
mkdir /home/username/src <-- then cd to it
bzip2: tar -jcvf my_archive.tar.gz folder/
decompress: tar -jxvf archive.tar.bz2 Documents/ Step 2, with 'ls' verify that the file is there
(if not, mv ../file.tar.gz /home/username/src/)
Show the content of .tar, .gz or .bz2 without decompressing it:
Step 3, decompress the file (if .zip: unzip <file>)
gzip: <--
gzip -ztf archive.tar.gz Step 4, use 'ls', you should see a new directory
bzip2: Step 5, cd to the new directory
bzip2 -jtf archive.tar.bz2 Step 6.1, use ls to verify you have an INSTALL file,
tar: then: more INSTALL
tar -tf archive.tar If you don't have an INSTALL file:
Step 6.2, execute ./configure <-- creates a makefile
tar extra: Step 6.2.1, run make <-- builds application binaries
tar -rvf archive.tar file.txt = add a file to the .tar Step 6.2.2 : switch to root --> su
Step 6.2.3 : make install <-- installs the software
You can also directly compress a single file and view the file Step 7, read the readme file
without decompressing:

Step 1, use gzip or bzip2 to compress the file:


gzip numbers.txt

Step 2, view the file without decompressing it:


zcat = view the entire file in the console (same as cat)
zmore = view one screen at a time the content of the file (same as more)
zless = view one line of the file at a time (same as less)
70
Retrieve Data
Download files from the remote storage https://github.com/iterative/dvc
Cheat Sheet $ dvc pull
https://dvc.org/chat
Download files from a specific .dvc file

$ dvc pull filename.dvc


Basics Checkout files from cache into working space
Initializing
Other Commands
$ dvc checkout
Initialize a DVC environment Set/unset cache directory location

$ dvc init The Pipeline $ dvc cache dir /path

Add transformations and generate a


Commit outputs to cache
Remote
Set up a remote to keep and share data files stage file from a given command
$ dvc commit
$ dvc run -d dependencyfile \
$ dvc remote add -d myremote /path *Use if you specified --no-commit in dvc add/run/repro
-o outputfile python command.py
*Possible remotes include local, s3, gs, azure, ssh, hdfs Config repository or global options
and http. *Use --file to specify the name of the generated .dvc file.
*Use --metrics to output a file containing the metric. $ dvc config
Show all available remotes
*Config the default remote using core.remote myremote
$ dvc remote list Metrics
*Config core (loglevel, remote), cache and state settings
Collect and display project metrics
Modify remote settings Fetch files from the remote to the local cache
$ dvc metrics show
$ dvc remote modify myremote $ dvc fetch file.dvc
*Use --all to show the metrics in all branches.
*Use if remote requires extra configuration Remove unused objects from cache
Visualizing
Adding Files $ dvc gc
Show stages in a pipeline
Add files under DVC control
$ dvc pipeline show --ascii file.dvc Import file from URL to local directory
$ dvc add filename
*Add --commands or --outs to show more detail. $ dvc import url /path
*Use --no-commit to stop adding the file to the cache.
Show connected pipelines of DVC stages *Supported schemes include local, s3, gs, azure, ssh, hdfs
Share Data and http.
$ dvc pipeline list
Push all data files to the remote storage Remove data files tracked by dvc

$ dvc push Reproducing $ dvc remove filename.dvc


Reproduce outputs defined in .dvc file
Push outputs of a specific .dvc file Show changed stages in the pipeline
$ dvc repro filename.dvc
$ dvc push filename.dvc $ dvc status
*Name a .dvc file “Dvcfile” to be use by dvc repro by default

Made by Carl Handlin based on the documentation for DVC at https://dvc.org/doc


71
Git Cheat Sheet

Create a Repository Working with Branches Make a change Synchronize


From scratch -- Create a new local List all local branches Stages the file, ready for commit Get the latest changes from origin
repository $ git branch $ git add [file] (no merge)
$ git init [project name] $ git fetch
List all branches, local and remote Stage all changed files, ready for commit
Download from an existing repository $ git branch -av $ git add . Fetch the latest changes from origin
$ git clone my_url and merge
Switch to a branch, my_branch, Commit all staged files to versioned history $ git pull
and update working directory $ git commit -m “commit message”
Observe your Repository $ git checkout my_branch Fetch the latest changes from origin
List new or modified files not yet Commit all your tracked files to and rebase
committed Create a new branch called new_branch versioned history $ git pull --rebase
$ git status $ git branch new_branch $ git commit -am “commit message”
Push local changes to the origin
Show the changes to files not yet staged Delete the branch called my_branch Unstages file, keeping the file changes $ git push
$ git diff $ git branch -d my_branch $ git reset [file]

Show the changes to staged files Merge branch_a into branch_b


Finally!
Revert everything to the last commit
$ git diff --cached $ git checkout branch_b When in doubt, use git help
$ git reset --hard
$ git merge branch_a $ git command --help
Show all staged and unstaged
file changes
Tag the current commit Or visit https://training.github.com/
$ git diff HEAD
$ git tag my_tag for official GitHub training.
Show the changes between two
commit ids
$ git diff commit1 commit2

List the change dates and authors


for a file
$ git blame [file]

Show the file changes for a commit


id and/or file
$ git show [commit]:[file]

Show full change history


$ git log

Show change history for file/directory


including diffs
$ git log -p [file/directory]
72
Git Cheat Sheet

Git Basics Rewriting Git History

git init Create empty Git repo in specified directory. Run with no arguments git commit Replace the last commit with the staged changes and last commit
<directory> to initialize the current directory as a git repository. --amend combined. Use with nothing staged to edit the last commit’s message.

git clone <repo> Clone repo located at <repo> onto local machine. Original repo can be git rebase <base> Rebase the current branch onto <base>. <base> can be a commit ID,
located on the local filesystem or on a remote machine via HTTP or SSH. a branch name, a tag, or a relative reference to HEAD.

git config Define author name to be used for all commits in current repo. Devs git reflog Show a log of changes to the local repository’s HEAD. Add
user.name <name> commonly use --global flag to set config options for current user. --relative-date flag to show date info or --all to show all refs.

git add Stage all changes in <directory> for the next commit.
<directory> Replace <directory> with a <file> to change a specific file. Git Branches

git commit -m Commit the staged snapshot, but instead of launching a text editor, git branch List all of the branches in your repo. Add a <branch> argument to
"<message>" use <message> as the commit message. create a new branch with the name <branch>.

git status List which files are staged, unstaged, and untracked. git checkout -b Create and check out a new branch named <branch>. Drop the -b
<branch> flag to checkout an existing branch.

git log Display the entire commit history using the default format. git merge <branch> Merge <branch> into the current branch.
For customization see additional options.

git diff Show unstaged changes between your index and working directory.
Remote Repositories

git remote add Create a new connection to a remote repo. After adding a remote,
Undoing Changes <name> <url> you can use <name> as a shortcut for <url> in other commands.

git revert Create new commit that undoes all of the changes made in git fetch Fetches a specific <branch>, from the repo. Leave off <branch> to
<commit> <commit>, then apply it to the current branch. <remote> <branch> fetch all remote refs.

git reset <file> Remove <file> from the staging area, but leave the working directory git pull <remote> Fetch the specified remote’s copy of current branch and immediately
unchanged. This unstages a file without overwriting any changes. merge it into the local copy.

git clean -n Shows which files would be removed from working directory. Use git push Push the branch to <remote>, along with necessary commits and
the -f flag in place of the -n flag to execute the clean. <remote> <branch> objects. Creates named branch in the remote repo if it doesn’t exist.
73

Visit atlassian.com/git for more information, training, and tutorials


Additional Options +

git config git diff

git config --global Define the author name to be used for all commits by the current user. git diff HEAD Show difference between working directory and last commit.
user.name <name> git diff --cached Show difference between staged changes and last commit

git config --global Define the author email to be used for all commits by the current user.
user.email <email>
git reset
git config --global
Create shortcut for a Git command. E.g. alias.glog log --graph git reset Reset staging area to match most recent commit, but leave the
alias. <alias-name>
--oneline will set git glog equivalent to git log --graph --oneline. working directory unchanged.
<git-command>

git config --system Set text editor used by commands for all users on the machine. <editor> git reset --hard Reset staging area and working directory to match most recent
core.editor <editor> arg should be the command that launches the desired editor (e.g., vi). commit and overwrites all changes in the working directory.

git config Open the global configuration file in a text editor for manual editing. git reset <commit> Move the current branch tip backward to <commit>, reset the
--global --edit staging area to match, but leave the working directory alone.

git reset --hard Same as previous, but resets both the staging area & working directory to
git log <commit> match. Deletes uncommitted changes, and all commits after <commit>.

git log -<limit> Limit number of commits by <limit>. E.g. git log -5 will limit to 5
commits. git rebase

git log --oneline Condense each commit to a single line. git rebase -i Interactively rebase current branch onto <base>. Launches editor to enter
git log -p Display the full diff of each commit. <base> commands for how each commit will be transferred to the new base.

git log --stat Include which files were altered and the relative number of lines
that were added or deleted from each of them. git pull

git log --author= Search for commits by a particular author. git pull --rebase Fetch the remote’s copy of current branch and rebases it into the local
”<pattern>” <remote> copy. Uses git rebase instead of merge to integrate the branches.

git log Search for commits with a commit message that matches <pattern>.
--grep=”<pattern>”
git push

git log Show commits that occur between <since> and <until>. Args can be a git push <remote> Forces the git push even if it results in a non-fast-forward merge. Do not use
<since>..<until> commit ID, branch name, HEAD, or any other kind of revision reference. --force the --force flag unless you’re absolutely sure you know what you’re doing.

git log -- <file> Only display commits that have the specified file. git push <remote> Push all of your local branches to the specified remote.
--all

git log --graph --graph flag draws a text based graph of commits on left side of commit git push <remote> Tags aren’t automatically pushed when you push a branch or use the
--decorate msgs. --decorate adds names of branches or tags of commits shown. --tags --all flag. The --tags flag sends all of your local tags to the remote repo.
74

Visit atlassian.com/git for more information, training, and tutorials


75
SQL cheat sheet

Basic Queries The Joy of JOINs


-- filter your columns
SELECT col1, col2, col3, ... FROM table1
-- filter the rows
WHERE col4 = 1 AND col5 = 2
-- aggregate the data A B A B
GROUP by …
-- limit aggregated data
HAVING count(*) > 1
-- order of the results
ORDER BY col2 LEFT OUTER JOIN - all rows from table A, INNER JOIN - fetch the results that RIGHT OUTER JOIN - all rows from table B,
even if they do not exist in table B exist in both tables even if they do not exist in table A
Useful keywords for SELECTS:
DISTINCT - return unique results
BETWEEN a AND b - limit the range, the values can be Updates on JOINed Queries Useful Utility Functions
numbers, text, or dates You can use JOINs in your UPDATEs -- convert strings to dates:
LIKE - pattern search within the column text UPDATE t1 SET a = 1 TO_DATE (Oracle, PostgreSQL), STR_TO_DATE (MySQL)
IN (a, b, c) - check if the value is contained among given. FROM table1 t1 JOIN table2 t2 ON t1.id = t2.t1_id -- return the first non-NULL argument:
WHERE t1.col1 = 0 AND t2.col2 IS NULL; COALESCE (col1, col2, “default value”)
-- return current time:
Data Modification NB! Use database specific syntax, it might be faster! CURRENT_TIMESTAMP
-- update specific data with the WHERE clause -- compute set operations on two result sets
UPDATE table1 SET col1 = 1 WHERE col2 = 2 SELECT col1, col2 FROM table1
Semi JOINs
-- insert values manually UNION / EXCEPT / INTERSECT
You can use subqueries instead of JOINs: SELECT col3, col4 FROM table2;
INSERT INTO table1 (ID, FIRST_NAME, LAST_NAME)
VALUES (1, ‘Rebel’, ‘Labs’); SELECT col1, col2 FROM table1 WHERE id IN
-- or by using the results of a query (SELECT t1_id FROM table2 WHERE date > Union - returns data from both queries
INSERT INTO table1 (ID, FIRST_NAME, LAST_NAME) CURRENT_TIMESTAMP) Except - rows from the first query that are not present
SELECT id, last_name, first_name FROM table2 in the second query
Intersect - rows that are returned from both queries
Indexes
Views If you query by a column, index it!
A VIEW is a virtual table, which is a result of a query. CREATE INDEX index1 ON table1 (col1) Reporting
They can be used to create virtual tables of complex queries. Use aggregation functions
Don’t forget:
CREATE VIEW view1 AS COUNT - return the number of rows
SELECT col1, col2 Avoid overlapping indexes SUM - cumulate the values
FROM table1 Avoid indexing on too many columns AVG - return the average for the group
WHERE … Indexes can speed up DELETE and UPDATE operations MIN / MAX - smallest / largest value
76
77
SQL CHEAT SHEET http://www.sqltutorial.org
QUERYING DATA FROM A TABLE QUERYING FROM MULTIPLE TABLES USING SQL OPERATORS

SELECT c1, c2 FROM t; SELECT c1, c2 SELECT c1, c2 FROM t1


Query data in columns c1, c2 from a table FROM t1 UNION [ALL]
INNER JOIN t2 ON condition; SELECT c1, c2 FROM t2;
SELECT * FROM t; Inner join t1 and t2 Combine rows from two queries
Query all rows and columns from a table
SELECT c1, c2 SELECT c1, c2 FROM t1
SELECT c1, c2 FROM t FROM t1 INTERSECT
WHERE condition; LEFT JOIN t2 ON condition; SELECT c1, c2 FROM t2;
Query data and filter rows with a condition Left join t1 and t1 Return the intersection of two queries

SELECT DISTINCT c1 FROM t SELECT c1, c2


WHERE condition; FROM t1 SELECT c1, c2 FROM t1
Query distinct rows from a table RIGHT JOIN t2 ON condition; MINUS
Right join t1 and t2 SELECT c1, c2 FROM t2;
Subtract a result set from another result set
SELECT c1, c2 FROM t
ORDER BY c1 ASC [DESC]; SELECT c1, c2
Sort the result set in ascending or descending FROM t1 SELECT c1, c2 FROM t1
order FULL OUTER JOIN t2 ON condition; WHERE c1 [NOT] LIKE pattern;
Perform full outer join Query rows using pattern matching %, _
SELECT c1, c2 FROM t
ORDER BY c1 SELECT c1, c2
SELECT c1, c2 FROM t
LIMIT n OFFSET offset; FROM t1
WHERE c1 [NOT] IN value_list;
Skip offset of rows and return the next n rows CROSS JOIN t2;
Query rows in a list
Produce a Cartesian product of rows in tables
SELECT c1, aggregate(c2)
FROM t SELECT c1, c2 SELECT c1, c2 FROM t
GROUP BY c1; FROM t1, t2; WHERE c1 BETWEEN low AND high;
Group rows using an aggregate function Another way to perform cross join Query rows between two values

SELECT c1, aggregate(c2) SELECT c1, c2 SELECT c1, c2 FROM t


FROM t FROM t1 A WHERE c1 IS [NOT] NULL;
GROUP BY c1 INNER JOIN t2 B ON condition; Check if values in a table is NULL or not
HAVING condition; Join t1 to itself using INNER JOIN clause
Filter groups using HAVING clause
78
SQL CHEAT SHEET http://www.sqltutorial.org
MANAGING TABLES USING SQL CONSTRAINTS MODIFYING DATA

CREATE TABLE t ( CREATE TABLE t( INSERT INTO t(column_list)


id INT PRIMARY KEY, c1 INT, c2 INT, c3 VARCHAR, VALUES(value_list);
name VARCHAR NOT NULL, PRIMARY KEY (c1,c2) Insert one row into a table
price INT DEFAULT 0 );
); Set c1 and c2 as a primary key INSERT INTO t(column_list)
Create a new table with three columns
VALUES (value_list),
CREATE TABLE t1( (value_list), ….;
DROP TABLE t ; c1 INT PRIMARY KEY, Insert multiple rows into a table
Delete the table from the database c2 INT,
FOREIGN KEY (c2) REFERENCES t2(c2) INSERT INTO t1(column_list)
); SELECT column_list
ALTER TABLE t ADD column; Set c2 column as a foreign key FROM t2;
Add a new column to the table
Insert rows from t2 into t1
CREATE TABLE t(
ALTER TABLE t DROP COLUMN c ; c1 INT, c1 INT, UPDATE t
Drop column c from the table UNIQUE(c2,c3) SET c1 = new_value;
); Update new value in the column c1 for all rows
ALTER TABLE t ADD constraint; Make the values in c1 and c2 unique
Add a constraint UPDATE t
CREATE TABLE t( SET c1 = new_value,
c1 INT, c2 INT, c2 = new_value
ALTER TABLE t DROP constraint; WHERE condition;
Drop a constraint CHECK(c1> 0 AND c1 >= c2)
); Update values in the column c1, c2 that match
Ensure c1 > 0 and values in c1 >= c2 the condition

Rename a table from t1 to t2 DELETE FROM t;


CREATE TABLE t( Delete all data in a table
c1 INT PRIMARY KEY,
ALTER TABLE t1 RENAME c1 TO c2 ; c2 VARCHAR NOT NULL
Rename column c1 to c2 ); DELETE FROM t
Set values in c2 column not NULL WHERE condition;
Delete subset of rows in a table
TRUNCATE TABLE t;
Remove all data in a table
79
SQL CHEAT SHEET http://www.sqltutorial.org
MANAGING VIEWS MANAGING INDEXES MANAGING TRIGGERS

CREATE VIEW v(c1,c2) CREATE INDEX idx_name


CREATE OR MODIFY TRIGGER trigger_name
AS ON t(c1,c2);
WHEN EVENT
SELECT c1, c2 Create an index on c1 and c2 of the table t
ON table_name TRIGGER_TYPE
FROM t;
EXECUTE stored_procedure;
Create a new view that consists of c1 and c2
CREATE UNIQUE INDEX idx_name Create or modify a trigger
ON t(c3,c4);
CREATE VIEW v(c1,c2) Create a unique index on c3, c4 of the table t
WHEN
AS • BEFORE – invoke before the event occurs
SELECT c1, c2 • AFTER – invoke after the event occurs
FROM t; DROP INDEX idx_name;
WITH [CASCADED | LOCAL] CHECK OPTION; Drop an index
Create a new view with check option EVENT
• INSERT – invoke for INSERT
SQL AGGREGATE FUNCTIONS • UPDATE – invoke for UPDATE
CREATE RECURSIVE VIEW v • DELETE – invoke for DELETE
AS AVG returns the average of a list
select-statement -- anchor part
COUNT returns the number of elements of a list
UNION [ALL] TRIGGER_TYPE
select-statement; -- recursive part SUM returns the total of a list • FOR EACH ROW
Create a recursive view • FOR EACH STATEMENT
MAX returns the maximum value in a list

CREATE TEMPORARY VIEW v MIN returns the minimum value in a list CREATE TRIGGER before_insert_person
AS BEFORE INSERT
SELECT c1, c2 ON person FOR EACH ROW
FROM t; EXECUTE stored_procedure;
Create a temporary view Create a trigger invoked before a new row is
inserted into the person table

DROP VIEW view_name


Delete a view DROP TRIGGER trigger_name
Delete a specific trigger
80
CS 229 – Machine Learning https://stanford.edu/~shervine

n
VIP Refresher: Probabilities and Statistics Remark: for any event B in the sample space, we have P (B) = P (B|Ai )P (Ai ).
i=1
ÿ

Afshine Amidi and Shervine Amidi r Extended form of Bayes’ rule – Let {Ai , i œ [[1,n]]} be a partition of the sample space.
We have:
P (B|Ak )P (Ak )
August 6, 2018 P (Ak |B) = n
P (B|Ai )P (Ai )
i=1
ÿ

Introduction to Probability and Combinatorics


r Sample space – The set of all possible outcomes of an experiment is known as the sample r Independence – Two events A and B are independent if and only if we have:
space of the experiment and is denoted by S. P (A fl B) = P (A)P (B)
r Event – Any subset E of the sample space is known as an event. That is, an event is a set
consisting of possible outcomes of the experiment. If the outcome of the experiment is contained
in E, then we say that E has occurred. Random Variables
r Axioms of probability – For each event E, we denote P (E) as the probability of event E r Random variable – A random variable, often noted X, is a function that maps every element
occuring. By noting E1 ,...,En mutually exclusive events, we have the 3 following axioms: in a sample space to a real line.
n n
r Cumulative distribution function (CDF) – The cumulative distribution function F ,
(1) 0 6 P (E) 6 1 (2) P (S) = 1 (3) P Ei = P (Ei ) which is monotonically non-decreasing and is such that lim F (x) = 0 and lim F (x) = 1, is
xæ≠Œ xæ+Œ
A B

i=1 i=1
defined as:
€ ÿ

F (x) = P (X 6 x)
r Permutation – A permutation is an arrangement of r objects from a pool of n objects, in a
given order. The number of such arrangements is given by P (n, r), defined as: Remark: we have P (a < X 6 B) = F (b) ≠ F (a).
n!
P (n, r) = r Probability density function (PDF) – The probability density function f is the probability
(n ≠ r)! that X takes on values between two adjacent realizations of the random variable.
r Relationships involving the PDF and CDF – Here are the important properties to know
r Combination – A combination is an arrangement of r objects from a pool of n objects, where in the discrete (D) and the continuous (C) cases.
the order does not matter. The number of such arrangements is given by C(n, r), defined as:
P (n, r) n! Case CDF F PDF f Properties of PDF
C(n, r) = =
r! r!(n ≠ r)!
(D) F (x) = P (X = xi ) f (xj ) = P (X = xj ) 0 6 f (xj ) 6 1 and f (xj ) = 1
Remark: we note that for 0 6 r 6 n, we have P (n,r) > C(n,r).
xi 6x j
ÿ ÿ

x dF +Œ
Conditional Probability (C) F (x) = f (y)dy f (x) = f (x) > 0 and f (x)dx = 1
ˆ ˆ

≠Œ dx ≠Œ
r Bayes’ rule – For events A and B such that P (B) > 0, we have:
P (B|A)P (A)
P (A|B) = r Variance – The variance of a random variable, often noted Var(X) or ‡ 2 , is a measure of the
P (B) spread of its distribution function. It is determined as follows:
Remark: we have P (A fl B) = P (A)P (B|A) = P (A|B)P (B). Var(X) = E[(X ≠ E[X])2 ] = E[X 2 ] ≠ E[X]2

r Partition – Let {Ai , i œ [[1,n]]} be such that for all i, Ai ”= ?. We say that {Ai } is a partition
if we have: r Standard deviation – The standard deviation of a random variable, often noted ‡, is a
n
measure of the spread of its distribution function which is compatible with the units of the
actual random variable. It is determined as follows:
’i ”= j, Ai fl Aj = ÿ and Ai = S
i=1 ‡= Var(X)

Stanford University 1 Fall 2018


81
CS 229 – Machine Learning https://stanford.edu/~shervine

r Expectation and Moments of the Distribution – Here are the expressions of the expected r Marginal density and cumulative distribution – From the joint density probability
value E[X], generalized expected value E[g(X)], kth moment E[X k ] and characteristic function function fXY , we have:
Â(Ê) for the discrete and continuous cases:
Case Marginal density Cumulative function
Case E[X] E[g(X)] E[X k ] Â(Ê) (D) fX (xi ) = fXY (xi ,yj ) FXY (x,y) = fXY (xi ,yj )
n n n n j xi 6x yj 6y
ÿ ÿÿ

(D) xi f (xi ) g(xi )f (xi ) xki f (xi ) f (xi )eiÊxi


+Πx y
i=1 i=1 i=1 i=1 (C) fX (x) = fXY (x,y)dy FXY (x,y) = fXY (xÕ ,y Õ )dxÕ dy Õ
ˆ ˆ ˆ
ÿ ÿ ÿ ÿ

≠Œ ≠Œ ≠Œ
+Œ +Œ +Œ +Œ
(C) xf (x)dx g(x)f (x)dx xk f (x)dx f (x)eiÊx dx
ˆ ˆ ˆ ˆ

≠Œ ≠Œ ≠Œ ≠Œ
r Distribution of a sum of independent random variables – Let Y = X1 + ... + Xn with
X1 , ..., Xn independent. We have:
Remark: we have eiÊx = cos(Êx) + i sin(Êx). n

ÂY (Ê) = ÂXk (Ê)


r Revisiting the kth moment – The kth moment can also be computed with the characteristic
function as follows: k=1
Ÿ

1 ˆk  2
E[X k ] = r Covariance – We define the covariance of two random variables X and Y , that we note ‡XY
ik ˆÊ k or more commonly Cov(X,Y ), as follows:
Ê=0
5 6

2
Cov(X,Y ) , ‡XY = E[(X ≠ µX )(Y ≠ µY )] = E[XY ] ≠ µX µY
r Transformation of random variables – Let the variables X and Y be linked by some
function. By noting fX and fY the distribution function of X and Y respectively, we have:
r Correlation – By noting ‡X , ‡Y the standard deviations of X and Y , we define the correlation
between the random variables X and Y , noted flXY , as follows:
dy 2
‡XY
flXY =
- -

‡ X ‡Y
fY (y) = fX (x) -
- dx -
-

r Leibniz integral rule – Let g be a function of x and potentially c, and a, b boundaries that
may depend on c. We have: Remarks: For any X, Y , we have flXY œ [≠1,1]. If X and Y are independent, then flXY = 0.

ˆ b ˆb ˆa ˆg r Main distributions – Here are the main distributions to have in mind:


g(x)dx = · g(a) + (x)dx
ˆ b
· g(b) ≠
ˆc a ˆc ˆc a ˆc
3ˆ 4

Type Distribution PDF Â(Ê) E[X] Var(X)

r Chebyshev’s inequality – Let X be a random variable with expected value µ and standard P (X = x) = px q n≠x (peiÊ + q)n np npq
deviation ‡. For k, ‡ > 0, we have the following inequality: x
X ≥ B(n, p)
Binomial
1n2

1
x œ [[0,n]]
P (|X ≠ µ| > k‡) 6 (D)
k2 iÊ
P (X = x) = e eµ(e µ µ
µx ≠µ ≠1)
X ≥ Po(µ)
x!
Poisson xœN
Jointly Distributed Random Variables
1 eiÊb ≠ eiÊa a+b (b ≠ a)2
X ≥ U (a, b) f (x) =
2 12
r Conditional density – The conditional density of X with respect to Y , often noted fX|Y ,
b≠a (b ≠ a)iÊ
Uniform x œ [a,b]
is defined as follows:
1 1 2
fXY (x,y) 2
≠1 ‡ ‡2
fX|Y (x) = (C) X ≥ N (µ, ‡) f (x) = Ô e eiʵ≠ 2 Ê µ ‡2
fY (y) 2fi‡
Gaussian
! x≠µ "2

xœR
1 1 1
r Independence – Two random variables X and Y are said to be independent if we have: X ≥ Exp(⁄) f (x) = ⁄e≠⁄x iÊ
1≠ ⁄
⁄ ⁄2
fXY (x,y) = fX (x)fY (y) Exponential x œ R+

Stanford University 2 Fall 2018


82
CS 229 – Machine Learning https://stanford.edu/~shervine

Parameter estimation
r Random sample – A random sample is a collection of n random variables X1 , ..., Xn that
are independent and identically distributed with X.
r Estimator – An estimator ◊ˆ is a function of the data that is used to infer the value of an
unknown parameter ◊ in a statistical model.
r Bias – The bias of an estimator ◊ˆ is defined as being the difference between the expected
value of the distribution of ◊ˆ and the true value, i.e.:
Bias(◊)
ˆ = E[◊]
ˆ ≠◊

Remark: an estimator is said to be unbiased when we have E[◊]


ˆ = ◊.
r Sample mean and variance – The sample mean and the sample variance of a random
sample are used to estimate the true mean µ and the true variance ‡ 2 of a distribution, are
noted X and s2 respectively, and are such that:
n n
1 1
X= Xi and s2 = ‡
ˆ2 = (Xi ≠ X)2
n
i=1 i=1
n≠1
ÿ ÿ

r Central Limit Theorem – Let us have a random sample X1 , ..., Xn following a given
distribution with mean µ and variance ‡ 2 , then we have:

X
næ+Œ
≥ N µ, Ô
n
1 2

Stanford University 3 Fall 2018


83
Probability Cheatsheet v2.0 Thinking Conditionally Law of Total Probability (LOTP)
Let B1 , B2 , B3 , ...Bn be a partition of the sample space (i.e., they are
Compiled by William Chen (http://wzchen.com) and Joe Blitzstein, Independence disjoint and their union is the entire sample space).
with contributions from Sebastian Chiu, Yuan Jiang, Yuqi Hou, and
Independent Events A and B are independent if knowing whether
Jessy Hwang. Material based on Joe Blitzstein’s (@stat110) lectures
P (A) = P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ) + · · · + P (A|Bn )P (Bn )
A occurred gives no information about whether B occurred. More
(http://stat110.net) and Blitzstein/Hwang’s Introduction to P (A) = P (A \ B1 ) + P (A \ B2 ) + · · · + P (A \ Bn )
formally, A and B (which have nonzero probability) are independent if
Probability textbook (http://bit.ly/introprobability). Licensed
and only if one of the following equivalent statements holds: For LOTP with extra conditioning, just add in another event C!
under CC BY-NC-SA 4.0. Please share comments, suggestions, and errors
at http://github.com/wzchen/probability_cheatsheet. P (A \ B) = P (A)P (B) P (A|C) = P (A|B1 , C)P (B1 |C) + · · · + P (A|Bn , C)P (Bn |C)
P (A|B) = P (A) P (A|C) = P (A \ B1 |C) + P (A \ B2 |C) + · · · + P (A \ Bn |C)
Last Updated September 4, 2015 P (B|A) = P (B)
Special case of LOTP with B and B c as partition:
Conditional Independence A and B are conditionally independent c c
P (A) = P (A|B)P (B) + P (A|B )P (B )
c
given C if P (A \ B|C) = P (A|C)P (B|C). Conditional independence
Counting does not imply independence, and independence does not imply P (A) = P (A \ B) + P (A \ B )
conditional independence.

Multiplication Rule Unions, Intersections, and Complements Bayes’ Rule


De Morgan’s Laws A useful identity that can make calculating
C
Bayes’ Rule, and with extra conditioning (just add in C!)
cak
e probabilities of unions easier by relating them to intersections, and
V vice versa. Analogous results hold with more than two sets.
C waffle
P (B|A)P (A)
S P (A|B) =
ke c c c
ca P (B)
cake
(A [ B) = A \ B
V
c c c
waffle
(A \ B) = A [ B P (B|A, C)P (A|C)
wa
ffle C S P (A|B, C) =
cake
P (B|C)
V
S
Joint, Marginal, and Conditional We can also write
waffl
e
P (A, B, C) P (B, C|A)P (A)
Joint Probability P (A \ B) or P (A, B) – Probability of A and B.
Let’s say we have a compound experiment (an experiment with Marginal (Unconditional) Probability P (A) – Probability of A. P (A|B, C) = =
P (B, C) P (B, C)
multiple components). If the 1st component has n1 possible outcomes, Conditional Probability P (A|B) = P (A, B)/P (B) – Probability of
the 2nd component has n2 possible outcomes, . . . , and the rth A, given that B occurred. Odds Form of Bayes’ Rule
component has nr possible outcomes, then overall there are
Conditional Probability is Probability P (A|B) is a probability P (A|B) P (B|A) P (A)
n1 n2 . . . nr possibilities for the whole experiment. =
function for any fixed B. Any theorem that holds for probability also P (Ac |B) P (B|Ac ) P (Ac )
holds for conditional probability.
Sampling Table The posterior odds of A are the likelihood ratio times the prior odds.
Probability of an Intersection or Union
Intersections via Conditioning
Random Variables and their Distributions
P (A, B) = P (A)P (B|A)
P (A, B, C) = P (A)P (B|A)P (C|A, B)
PMF, CDF, and Independence
Probability Mass Function (PMF) Gives the probability that a
2
Unions via Inclusion-Exclusion discrete random variable takes on the value x.
8
5
7 9 pX (x) = P (X = x)
P (A [ B) = P (A) + P (B) P (A \ B)
1 4 P (A [ B [ C) = P (A) + P (B) + P (C)
3 6
P (A \ B) P (A \ C) P (B \ C)
The sampling table gives the number of possible samples of size k out + P (A \ B \ C).
1.0

of a population of size n, under various assumptions about how the


sample is collected. Simpson’s Paradox
0.8
0.6

Order Matters Not Matter heart


pmf

k
With Replacement n
0.4


k
⇣ n + k 1⌘

n! ● ●
0.2

Without Replacement
(n k)! k
● ●
⇣ n⌘

band-aid
0.0

Naive Definition of Probability 0 1 2 3 4

Dr. Hibbert Dr. Nick x


If all outcomes are equally likely, the probability of an event A
happening is: It is possible to have
The PMF satisfies
c c c c
number of outcomes favorable to A
P (A | B, C) < P (A | B , C) and P (A | B, C ) < P (A | B , C )
pX (x) 0 and pX (x) = 1
Pnaive (A) = c
number of outcomes x
X

yet also P (A | B) > P (A | B ).


84
Cumulative Distribution Function (CDF) Gives the probability Indicator Random Variables LOTUS
that a random variable is less than or equal to x.
Indicator Random Variable is a random variable that takes on the Expected value of a function of an r.v. The expected value of X
FX (x) = P (X  x) value 1 or 0. It is always an indicator of some event: if the event is defined this way:
occurs, the indicator is 1; otherwise it is 0. They are useful for many
problems about counting how many events of some kind occur. Write
E(X) = xP (X = x) (for discrete X)
x
X


1 if A occurs,

1.0
● ●
IA =
0 if A does not occur.
(

1
E(X) = xf (x)dx (for continuous X)

0.8
2
Z

1
● ● Note that IA = IA , IA IB = IA\B , and IA[B = IA + IB I A IB .
The Law of the Unconscious Statistician (LOTUS) states that

0.6
you can find the expected value of a function of a random variable,

cdf
Distribution IA ⇠ Bern(p) where p = P (A).
Fundamental Bridge The expectation of the indicator for event A is g(X), in a similar way, by replacing the x in front of the PMF/PDF by

0.4
● ● the probability of event A: E(IA ) = P (A). g(x) but still working with the PMF/PDF of X:

0.2
E(g(X)) = g(x)P (X = x) (for discrete X)
● ● Variance and Standard Deviation
● x

0.0
X

2 2 2
Var(X) = E (X E(X)) = E(X ) (E(X))
0 1 2 3 4 1
E(g(X)) = g(x)f (x)dx (for continuous X)
x SD(X) = Var(X)
Z

1
q

The CDF is an increasing, right-continuous function with What’s a function of a random variable? A function of a random
Continuous RVs, LOTUS, UoU variable is also a random variable. For example, if X is the number of
bikes you see in an hour, then g(X) = 2X is the number of bike wheels
FX (x) ! 0 as x ! 1 and FX (x) ! 1 as x ! 1
X(X 1)
Independence Intuitively, two random variables are independent if Continuous Random Variables (CRVs) you see in that hour and h(X) = X 2 = 2 is the number of
knowing the value of one gives no information about the other. pairs of bikes such that you see both of those bikes in that hour.
Discrete r.v.s X and Y are independent if for all values of x and y What’s the probability that a CRV is in an interval? Take the
di↵erence in CDF values (or use the PDF as described later). What’s the point? You don’t need to know the PMF/PDF of g(X)
P (X = x, Y = y) = P (X = x)P (Y = y) to find its expected value. All you need is the PMF/PDF of X.
P (a  X  b) = P (X  b) P (X  a) = FX (b) FX (a)
Expected Value and Indicators 2
For X ⇠ N (µ, ), this becomes Universality of Uniform (UoU)
Expected Value and Linearity b µ a µ When you plug any CRV into its own CDF, you get a Uniform(0,1)
P (a  X  b) = random variable. When you plug a Uniform(0,1) r.v. into an inverse
Expected Value (a.k.a. mean, expectation, or average) is a weighted
✓ ◆ ✓ ◆

CDF, you get an r.v. with that CDF. For example, let’s say that a
average of the possible outcomes of our random variable.
random variable X has CDF
Mathematically, if x1 , x2 , x3 , . . . are all of the distinct possible values What is the Probability Density Function (PDF)? The PDF f
that X can take, the expected value of X is is the derivative of the CDF F . x
F (x) = 1 e , for x > 0
E(X) = xi P (X = xi ) 0
F (x) = f (x)
i By UoU, if we plug X into this function then we get a uniformly
P

A PDF is nonnegative and integrates to 1. By the fundamental distributed random variable.


X Y X+Y
3 4 7
theorem of calculus, to get from PDF back to CDF we can integrate: X
F (X) = 1 e
2 2 4
⇠ Unif(0, 1)
6 8 14 F (x) = f (t)dt
10 23 33
Z x

1 Similarly, if U ⇠ Unif(0, 1) then F 1 (U ) has CDF F . The key point is


1 –3 –2 that for any continuous random variable X, we can transform it into a
Uniform random variable and back by using its CDF.
1.0

1 0 1

0.30
5 9 14
0.8

4 1 5

0.20
0.6

... ... ... Moments and MGFs


PDF
CDF
0.4

n n n 0.10
1 1 1
n∑ xi + n∑ yi = n∑ (xi + yi)
0.2

i=1 i=1 i=1


Moments
0.0

0.00

E(X) + E(Y) = E(X + Y) −4 −2 0 2 4 −4 −2 0 2 4


x x Moments describe the shape of a distribution. Let X have mean µ and
Linearity For any r.v.s X and Y , and constants a, b, c, To find the probability that a CRV takes on a value in an interval, standard deviation , and Z = (X µ)/ be the standardized version
integrate the PDF over that interval.
E(aX + bY + c) = aE(X) + bE(Y ) + c of X. The kth moment of X is µk = E(X k ) and the kth standardized
moment of X is mk = E(Z k ). The mean, variance, skewness, and
Same distribution implies same mean If X and Y have the same F (b) F (a) = f (x)dx kurtosis are important summaries of the shape of a distribution.
distribution, then E(X) = E(Y ) and, more generally, a
Z b

Mean E(X) = µ1
E(g(X)) = E(g(Y )) How do I find the expected value of a CRV? Analogous to the
discrete case, where you sum x times the PMF, for CRVs you integrate Variance Var(X) = µ2 µ21
Conditional Expected Value is defined like expectation, only
x times the PDF.
conditioned on any event A. 1
Skewness Skew(X) = m3
E(X) = xf (x)dx
E(X|A) = xP (X = x|A)
Z

1
x Kurtosis Kurt(X) = m4 3
P
85
Moment Generating Functions Marginal Distributions Covariance Properties For random variables W, X, Y, Z and
constants a, b:
MGF For any random variable X, the function To find the distribution of one (or more) random variables from a joint
tX PMF/PDF, sum/integrate over the unwanted random variables.
MX (t) = E(e ) Cov(X, Y ) = Cov(Y, X)
is the moment generating function (MGF) of X, if it exists for all Marginal PMF from joint PMF Cov(X + a, Y + b) = Cov(X, Y )
t in some open interval containing 0. The variable t could just as well Cov(aX, bY ) = abCov(X, Y )
P (X = x) = P (X = x, Y = y)
have been called u or v. It’s a bookkeeping device that lets us work
y Cov(W + X, Y + Z) = Cov(W, Y ) + Cov(W, Z) + Cov(X, Y )
X

with the function MX rather than the sequence of moments.


Why is it called the Moment Generating Function? Because Marginal PDF from joint PDF + Cov(X, Z)
the kth derivative of the moment generating function, evaluated at 0,
is the kth moment of X. fX (x) = fX,Y (x, y)dy
Correlation is location-invariant and scale-invariant For any
Z 1

1
k (k)
µk = E(X ) = MX (0) constants a, b, c, d with a and c nonzero,
Independence of Random Variables
This is true by Taylor expansion of etX since Corr(aX + b, cY + d) = Corr(X, Y )
Random variables X and Y are independent if and only if any of the
1 k k 1 k
tX E(X )t µk t following conditions holds:
MX (t) = E(e )= =
k=0
k! k=0
k! • Joint CDF is the product of the marginal CDFs Transformations
X X

• Joint PMF/PDF is the product of the marginal PMFs/PDFs


MGF of linear functions If we have Y = aX + b, then • Conditional distribution of Y given X is the marginal
distribution of Y One Variable Transformations Let’s say that we have a random
t(aX+b) bt (at)X bt
MY (t) = E(e ) = e E(e ) = e MX (at) variable X with PDF fX (x), but we are also interested in some
Write X ?
? Y to denote that X and Y are independent. function of X. We call this function Y = g(X). Also let y = g(x). If g
Uniqueness If it exists, the MGF uniquely determines the is di↵erentiable and strictly increasing (or strictly decreasing), then
distribution. This means that for any two random variables X and Y , Multivariate LOTUS the PDF of Y is
they are distributed the same (their PMFs/PDFs are equal) if and
LOTUS in more than one dimension is analogous to the 1D LOTUS.
only if their MGFs are equal. dx 1 d 1
For discrete random variables: fY (y) = fX (x) = fX (g (y)) g (y)
Summing Independent RVs by Multiplying MGFs. If X and Y dy dy
are independent, then E(g(X, Y )) = g(x, y)P (X = x, Y = y)
x y
t(X+Y ) tX tY The derivative of the inverse transformation is called the Jacobian.
XX

MX+Y (t) = E(e ) = E(e )E(e ) = MX (t) · MY (t)


For continuous random variables:
The MGF of the sum of two random variables is the product of the 1 Two Variable Transformations Similarly, let’s say we know the
MGFs of those two random variables. E(g(X, Y )) = g(x, y)fX,Y (x, y)dxdy joint PDF of U and V but are also interested in the random vector
(X, Y ) defined by (X, Y ) = g(U, V ). Let
Z 1 Z

1 1

Joint PDFs and CDFs Covariance and Transformations @u @u


@(u, v) @x @y
= @v @v
@(x, y) @x @y
!

Joint Distributions
Covariance and Correlation
The joint CDF of X and Y is
Covariance is the analog of variance for two random variables. be the Jacobian matrix. If the entries in this matrix exist and are
F (x, y) = P (X  x, Y  y) continuous, and the determinant of the matrix is never 0, then
Cov(X, Y ) = E ((X E(X))(Y E(Y ))) = E(XY ) E(X)E(Y )
In the discrete case, X and Y have a joint PMF
Note that @(u, v)
pX,Y (x, y) = P (X = x, Y = y). 2 2 fX,Y (x, y) = fU,V (u, v)
Cov(X, X) = E(X ) (E(X)) = Var(X) @(x, y)
In the continuous case, they have a joint PDF
Correlation is a standardized version of covariance that is always
@2 The inner bars tells us to take the matrix’s determinant, and the outer
between 1 and 1.
fX,Y (x, y) = FX,Y (x, y). bars tell us to take the absolute value. In a 2 ⇥ 2 matrix,
@x@y Cov(X, Y )
The joint PMF/PDF must be nonnegative and sum/integrate to 1. Var(X)Var(Y ) a b
bc|
c d
= |ad
Corr(X, Y ) = p

Covariance and Independence If two random variables are


independent, then they are uncorrelated. The converse is not
necessarily true (e.g., consider X ⇠ N (0, 1) and Y = X 2 ).
X ?
?Y ! Cov(X, Y ) = 0 ! E(XY ) = E(X)E(Y )
Convolutions
Covariance and Variance The variance of a sum can be found by Convolution Integral If you want to find the PDF of the sum of two
Conditional Distributions independent CRVs X and Y , you can do the following integral:
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
Conditioning and Bayes’ rule for discrete r.v.s
n
1
P (X = x, Y = y) P (X = x|Y = y)P (Y = y) fX+Y (t) = fX (x)fY (t x)dx
P (Y = y|X = x) = = Var(Xi ) + 2 Cov(Xi , Xj )
Z

Var(X1 + X2 + · · · + Xn ) = 1
P (X = x) P (X = x) i=1 i<j
X X

Conditioning and Bayes’ rule for continuous r.v.s If X and Y are independent then they have covariance 0, so Example Let X, Y ⇠ N (0, 1) be i.i.d. Then for each fixed t,
fX,Y (x, y) fX|Y (x|y)fY (y)
= X ?
? Y =) Var(X + Y ) = Var(X) + Var(Y ) 1 1 1
x2 /2 (t x)2 /2
fY |X (y|x) =
fX (x) fX (x) fX+Y (t) = dx
If X1 , X2 , . . . , Xn are identically distributed and have the same p e
2⇡
p e
2⇡
Hybrid Bayes’ rule
Z

1
covariance relationships (often by symmetry), then
P (A|X = x)fX (x) By completing the square and using the fact that a Normal PDF
fX (x|A) = Cov(X1 , X2 )
P (A)
Var(X1 + X2 + · · · + Xn ) = nVar(X1 ) + 2
2 integrates to 1, this works out to fX+Y (t) being the N (0, 2) PDF.
⇣ n⌘
86
Poisson Process • Let T ⇠ Expo(1/10) be how long you have to wait until the Central Limit Theorem (CLT)
shuttle comes. Given that you have already waited t minutes,
the expected additional waiting time is 10 more minutes, by the Approximation using CLT
Definition We have a Poisson process of rate arrivals per unit
time if the following conditions hold: ˙ to denote is approximately distributed. We can use the
memoryless property. That is, E(T |T > t) = t + 10.
We use ⇠
Central Limit Theorem to approximate the distribution of a random
1. The number of arrivals in a time interval of length t is Pois( t). Discrete Y Continuous Y 2
variable Y = X1 + X2 + · · · + Xn that is a sum of n i.i.d. random
variables Xi . Let E(Y ) = µY and Var(Y ) = Y . The CLT says
2. Numbers of arrivals in disjoint time intervals are independent.
E(Y ) = y yP (Y = y) E(Y ) = 1
yfY (y)dy 2
For example, the numbers of arrivals in the time intervals [0, 5], Y ⇠
˙ N (µY , Y )
P R1

y yP (Y = y|A) yf (y|A)dy
(5, 12), and [13, 23) are independent with Pois(5 ), Pois(7 ), Pois(10 ) E(Y |A) = E(Y |A) = 1 2
If the Xi are i.i.d. with mean µX and variance X , then µY = nµX
R1

distributions, respectively.
P

2 2
and Y =n X . For the sample mean X̄n , the CLT says
+ + + + +
1
Conditioning on a Random Variable We can also find E(Y |X),
the expected value of Y given the random variable X. This is a 2
0 T1 T2 T3 T4 T5 X̄n = (X1 + X2 + · · · + Xn ) ⇠
˙ N (µX , X /n)
function of the random variable X. It is not a number except in n
Count-Time Duality Consider a Poisson process of emails arriving certain special cases such as if X ?
? Y . To find E(Y |X), find
in an inbox at rate emails per hour. Let Tn be the time of arrival of E(Y |X = x) and then plug in X for x. For example: Asymptotic Distributions using CLT
the nth email (relative to some starting time 0) and Nt be the number D
of emails that arrive in [0, t]. Let’s find the distribution of T1 . The • If E(Y |X = x) = x3 + 5x, then E(Y |X) = X 3 + 5X. We use ! to denote converges in distribution to as n ! 1. The
event T1 > t, the event that you have to wait more than t hours to get CLT says that if we standardize the sum X1 + · · · + Xn then the
the first email, is the same as the event Nt = 0, which is the event that • Let Y be the number of successes in 10 independent Bernoulli distribution of the sum converges to N (0, 1) as n ! 1:
there are no emails in the first t hours. So trials with probability p of success and X be the number of
1 D
t t
successes among the first 3 trials. Then E(Y |X) = X + 7p. nµX ) ! N (0, 1)
e
p (X1 + · · · + Xn
P (T1 > t) = P (Nt = 0) = e ! P (T1  t) = 1
2 2
n
• Let X ⇠ N (0, 1) and Y = X . Then E(Y |X = x) = x since if
Thus we have T1 ⇠ Expo( ). By the memoryless property and similar we know X = x then we know Y = x2 . And E(X|Y = y) = 0 In other words, the CDF of the left-hand side goes to the standard
reasoning, the interarrival times between emails are i.i.d. Expo( ), i.e.,
p
since if we know Y = y then we know X = ± y, with equal Normal CDF, . In terms of the sample mean, the CLT says
the di↵erences Tn Tn 1 are i.i.d. Expo( ). p
n(X̄n µX ) D
probabilities (by symmetry). So E(Y |X) = X 2 , E(X|Y ) = 0.
! N (0, 1)
Properties of Conditional Expectation X
Order Statistics
1. E(Y |X) = E(Y ) if X ?
?Y Markov Chains
Definition Let’s say you have n i.i.d. r.v.s X1 , X2 , . . . , Xn . If you
arrange them from smallest to largest, the ith element in that list is 2. E(h(X)W |X) = h(X)E(W |X) (taking out what’s known)
the ith order statistic, denoted X(i) . So X(1) is the smallest in the list In particular, E(h(X)|X) = h(X). Definition
and X(n) is the largest in the list.
3. E(E(Y |X)) = E(Y ) (Adam’s Law, a.k.a. Law of Total 5/12 7/12 7/8
Note that the order statistics are dependent, e.g., learning X(4) = 42 Expectation) 1 1/2 1/3 1/4
1 2 3 4 5
gives us the information that X(1) , X(2) , X(3) are  42 and 1/2 1/4 1/6 1/8
X(5) , X(6) , . . . , X(n) are 42. Adam’s Law (a.k.a. Law of Total Expectation) can also be
Distribution Taking n i.i.d. random variables X1 , X2 , . . . , Xn with written in a way that looks analogous to LOTP. For any events A Markov chain is a random walk in a state space, which we will
CDF F (x) and PDF f (x), the CDF and PDF of X(i) are: A1 , A2 , . . . , An that partition the sample space, assume is finite, say {1, 2, . . . , M }. We let Xt denote which element of
the state space the walk is visiting at time t. The Markov chain is the
n sequence of random variables tracking where the walk is at all points
k n k
E(Y ) = E(Y |A1 )P (A1 ) + · · · + E(Y |An )P (An )
FX(i) (x) = P (X(i)  x) = F (x) (1 F (x)) in time, X0 , X1 , X2 , . . . . By definition, a Markov chain must satisfy
k=i
k For the special case where the partition is A, Ac , this says the Markov property, which says that if you want to predict where
n ⇣ ⌘
X

the chain will be at a future time, if we know the present state then
c c
i 1 n i the entire past history is irrelevant. Given the present, the past and
fX(i) (x) = n F (x) (1 F (x)) f (x) E(Y ) = E(Y |A)P (A) + E(Y |A )P (A )
i 1 future are conditionally independent. In symbols,
⇣n 1⌘

Uniform Order Statistics The jth order statistic of Eve’s Law (a.k.a. Law of Total Variance) P (Xn+1 = j|X0 = i0 , X1 = i1 , . . . , Xn = i) = P (Xn+1 = j|Xn = i)
i.i.d. U1 , . . . , Un ⇠ Unif(0, 1) is U(j) ⇠ Beta(j, n j + 1).
Var(Y ) = E(Var(Y |X)) + Var(E(Y |X))
State Properties
Conditional Expectation A state is either recurrent or transient.
MVN, LLN, CLT • If you start at a recurrent state, then you will always return
Conditioning on an Event We can find E(Y |A), the expected value back to that state at some point in the future. ♪You can
of Y given that event A occurred. A very important case is when A is check-out any time you like, but you can never leave. ♪
the event X = x. Note that E(Y |A) is a number. For example: Law of Large Numbers (LLN) • Otherwise you are at a transient state. There is some positive
• The expected value of a fair die roll, given that it is prime, is Let X1 , X2 , X3 . . . be i.i.d. with mean µ. The sample mean is probability that once you leave you will never return. ♪You
1 1 1 10 don’t have to go home, but you can’t stay here. ♪
3 ·2+ 3 ·3+ 3 ·5 = 3 .

• Let Y be the number of successes in 10 independent Bernoulli


X1 + X2 + X3 + · · · + Xn A state is either periodic or aperiodic.
X̄n =
trials with probability p of success. Let A be the event that the n • If you start at a periodic state of period k, then the GCD of
first 3 trials are all successes. Then the possible numbers of steps it would take to return back is
The Law of Large Numbers states that as n ! 1, X̄n ! µ with k > 1.
E(Y |A) = 3 + 7p probability 1. For example, in flips of a coin with probability p of
Heads, let Xj be the indicator of the jth flip being Heads. Then LLN • Otherwise you are at an aperiodic state. The GCD of the
since the number of successes among the last 7 trials is Bin(7, p). says the proportion of Heads converges to p (with probability 1). possible numbers of steps it would take to return back is 1.
87
Transition Matrix Continuous Distributions Gamma Distribution
Let the state space be {1, 2, . . . , M }. The transition matrix Q is the Gamma(3, 1) Gamma(3, 0.5)

M ⇥ M matrix where element qij is the probability that the chain goes Uniform Distribution

0.2
from state i to state j in one step:

0.10
Let us say that U is distributed Unif(a, b). We know the following:

PDF
PDF
Properties of the Uniform For a Uniform distribution, the

0.1
qij = P (Xn+1 = j|Xn = i)

0.05
probability of a draw from any interval within the support is
To find the probability that the chain goes from state i to state j in proportional to the length of the interval. See Universality of Uniform

0.0
0.00
and Order Statistics for other properties. 0 5 10 15 20 0 5 10 15 20
exactly m steps, take the (i, j) element of Qm . x x

Example William throws darts really badly, so his darts are uniform Gamma(10, 1) Gamma(5, 0.5)
(m)

0.10
qij = P (Xn+m = j|Xn = i) over the whole room because they’re equally likely to appear anywhere.
William’s darts have a Uniform distribution on the surface of the

0.10
~, i.e.,
If X0 is distributed according to the row vector PMF p room. The Uniform is the only distribution where the probability of
0.05

PDF
PDF

pj = P (X0 = j), then the PMF of Xn is p ~Qn . hitting in any specific region is proportional to the length/area/volume

0.05
of that region, and where the density of occurrence in any one specific
spot is constant throughout the whole support.
Chain Properties

0.00
0.00

0 5 10 15 20 0 5 10 15 20
x x
A chain is irreducible if you can get from anywhere to anywhere. If a Normal Distribution
chain (on a finite state space) is irreducible, then all of its states are Let us say that X is distributed Gamma(a, ). We know the following:
recurrent. A chain is periodic if any of its states are periodic, and is
Let us say that X is distributed N (µ, 2 ). We know the following:
aperiodic if none of its states are periodic. In an irreducible chain, all Central Limit Theorem The Normal distribution is ubiquitous Story You sit waiting for shooting stars, where the waiting time for a
states have the same period. because of the Central Limit Theorem, which states that the sample star is distributed Expo( ). You want to see n shooting stars before
mean of i.i.d. r.v.s will approach a Normal distribution as the sample you go home. The total waiting time for the nth shooting star is
A chain is reversible with respect to ~ s if si qij = sj qji for all i, j. size grows, regardless of the initial distribution. Gamma(n, ).
Examples of reversible chains include any chain with qij = qji , with Location-Scale Transformation Every time we shift a Normal Example You are at a bank, and there are 3 people ahead of you.
1 1 1
~
s = ( M , M , . . . , M ), and random walk on an undirected network. r.v. (by adding a constant) or rescale a Normal (by multiplying by a The serving time for each person is Exponential with mean 2 minutes.
constant), we change it to another Normal r.v. For any Normal Only one person at a time can be served. The distribution of your
Stationary Distribution X ⇠ N (µ, 2 ), we can transform it to the standard N (0, 1) by the waiting time until it’s your turn to be served is Gamma(3, 12 ).
following transformation:
Let us say that the vector ~s = (s1 , s2 , . . . , sM ) be a PMF (written as a X µ Beta Distribution
row vector). We will call ~
s the stationary distribution for the chain Z= ⇠ N (0, 1)
Beta(0.5, 0.5) Beta(2, 1)
if ~
sQ = ~s. As a consequence, if Xt has the stationary distribution,
5
2.0

then all future Xt+1 , Xt+2 , . . . also have the stationary distribution.
4

Standard Normal The Standard Normal, Z ⇠ N (0, 1), has mean 0


1.5

and variance 1. Its CDF is denoted by .


For irreducible, aperiodic chains, the stationary distribution exists, is
3
1.0

PDF
PDF

unique, and si is the long-run probability of a chain being at state i.


2

The expected number of steps to return to i starting from i is 1/si .


Exponential Distribution
0.5

Let us say that X is distributed Expo( ). We know the following:


0
0.0

To find the stationary distribution, you can solve the matrix equation
Story You’re sitting on an open meadow right before the break of 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(Q0 I)~ s 0 = 0. The stationary distribution is uniform if the columns x x

of Q sum to 1. dawn, wishing that airplanes in the night sky were shooting stars,
Beta(2, 8) Beta(5, 5)
because you could really use a wish right now. You know that shooting
2.5

Reversibility Condition Implies Stationarity If you have a PMF ~ s stars come on average every 15 minutes, but a shooting star is not
3
2.0

and a Markov chain with transition matrix Q, then si qij = sj qji for “due” to come just because you’ve waited so long. Your waiting time
1.5

all states i, j implies that ~


s is stationary. is memoryless; the additional time until the next shooting star comes
PDF
PDF
1.0

does not depend on how long you’ve waited already.


1
0.5

Random Walk on an Undirected Network Example The waiting time until the next shooting star is distributed
0
0.0

Expo(4) hours. Here = 4 is the rate parameter, since shooting 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
1
stars arrive at a rate of 1 per 1/4 hour on average. The expected time
until the next shooting star is 1/ = 1/4 hour.
2
Expos as a rescaled Expo(1) Conjugate Prior of the Binomial In the Bayesian approach to
Y ⇠ Expo( ) ! X = Y ⇠ Expo(1) statistics, parameters are viewed as random variables, to reflect our
3 uncertainty. The prior for a parameter is its distribution before
Memorylessness The Exponential Distribution is the only observing data. The posterior is the distribution for the parameter
4
continuous memoryless distribution. The memoryless property says after observing data. Beta is the conjugate prior of the Binomial
that for X ⇠ Expo( ) and any positive numbers s and t, because if you have a Beta-distributed prior on p in a Binomial, then
P (X > s + t|X > s) = P (X > t) the posterior distribution on p given the Binomial data is also
5
Beta-distributed. Consider the following two-level model:
Equivalently,
X a|(X > a) ⇠ Expo( ) X|p ⇠ Bin(n, p)
If you have a collection of nodes, pairs of which can be connected by For example, a product with an Expo( ) lifetime is always “as good as
undirected edges, and a Markov chain is run by going from the
p ⇠ Beta(a, b)
new” (it doesn’t experience wear and tear). Given that the product has
current node to a uniformly random node that is connected to it by an survived a years, the additional time that it will last is still Expo( ). Then after observing X = x, we get the posterior distribution
edge, then this is a random walk on an undirected network. The then
i ), x)
stationary distribution of this chain is proportional to the degree
Min of Expos If we have independent Xi ⇠ Expo( p|(X = x) ⇠ Beta(a + x, b + n
sequence (this is the sequence of degrees, where the degree of a node
min(X1 , . . . , Xk ) ⇠ Expo( 1 + 2 + · · · + k ).
Order statistics of the Uniform See Order Statistics.
is how many edges are attached to it). For example, the stationary Max of Expos If we have i.i.d. Xi ⇠ Expo( ), then
distribution of random walk on the network shown above is max(X1 , . . . , Xk ) has the same distribution as Y1 + Y2 + · · · + Yk ,
3 3 3 4 2
Beta-Gamma relationship If X ⇠ Gamma(a, ),
proportional to (3, 3, 2, 4, 2), so it’s ( 14 , 14 , 14 , 14 , 14 ). where Yj ⇠ Expo(j ) and the Yj are independent. Y ⇠ Gamma(b, ), with X ?
? Y then
88
X
• X+Y ⇠ Beta(a, b) • Conditional X|(X + Y = r) ⇠ HGeom(n, m, r) Properties Let X ⇠ Pois( 1) and Y ⇠ Pois( 2 ), with X ?
? Y.
X • Binomial-Poisson Relationship Bin(n, p) is approximately
• X +Y ?
? X+Y
1. Sum X + Y ⇠ Pois( 1 + 2)
Pois( ) if p is small.
1
This is known as the bank–post office result. • Binomial-Normal Relationship Bin(n, p) is approximately
2. Conditional X|(X + Y = n) ⇠ Bin n,
1+ 2
⇣ ⌘

2 N (np, np(1 p)) if n is large and p is not near 0 or 1. 3. Chicken-egg If there are Z ⇠ Pois( ) items and we randomly
(Chi-Square) Distribution and independently “accept” each item with probability p, then
2 Geometric Distribution
Let us say that X is distributed n. We know the following: the number of accepted items Z1 ⇠ Pois( p), and the number of
Story A Chi-Square(n) is the sum of the squares of n independent Let us say that X is distributed Geom(p). We know the following:
rejected items Z2 ⇠ Pois( (1 p)), and Z1 ?? Z2 .
standard Normal r.v.s. Story X is the number of “failures” that we will achieve before we
Properties and Representations achieve our first success. Our successes have probability p. Multivariate Distributions
1
2 2 2 Example If each pokeball we throw has probability 10to catch Mew,
1 Multinomial Distribution
).
X is distributed as Z1 + Z2 + · · · + Zn for i.i.d. Zi ⇠ N (0, 1)
the number of failed pokeballs will be distributed Geom( 10
X ⇠ Gamma(n/2, 1/2) Let us say that the vector X ~)
~ = (X1 , X2 , X3 , . . . , Xk ) ⇠ Multk (n, p
First Success Distribution where p
~ = (p1 , p2 , . . . , pk ).
Discrete Distributions Equivalent to the Geometric distribution, except that it includes the Story We have n items, which can fall into any one of the k buckets
first success in the count. This is 1 more than the number of failures. independently with the probabilities p ~ = (p1 , p2 , . . . , pk ).
If X ⇠ FS(p) then E(X) = 1/p.
Distributions for four sampling schemes Example Let us assume that every year, 100 students in the Harry
Negative Binomial Distribution Potter Universe are randomly and independently sorted into one of
Replace No Replace four houses with equal probability. The number of people in each of the
Let us say that X is distributed NBin(r, p). We know the following: ~), where p
houses is distributed Mult4 (100, p ~ = (0.25, 0.25, 0.25, 0.25).
Fixed # trials (n) Binomial HGeom
(Bern if n = 1) Story X is the number of “failures” that we will have before we Note that X1 + X2 + · · · + X4 = 100, and they are dependent.
Draw until r success NBin NHGeom achieve our rth success. Our successes have probability p. Joint PMF For n = n1 + n2 + · · · + nk ,
(Geom if r = 1) Example Thundershock has 60% accuracy and can faint a wild n! n n n
Raticate in 3 hits. The number of misses before Pikachu faints P (X
~ =~n) = p 1 p 2 . . . pk k
n1 !n2 ! . . . nk ! 1 2
Raticate with Thundershock is distributed NBin(3, 0.6).
Bernoulli Distribution Marginal PMF, Lumping, and Conditionals Marginally,
The Bernoulli distribution is the simplest case of the Binomial Hypergeometric Distribution Xi ⇠ Bin(n, pi ) since we can define “success” to mean category i. If
distribution, where we only have one trial (n = 1). Let us say that X is you lump together multiple categories in a Multinomial, then it is still
Let us say that X is distributed HGeom(w, b, n). We know the
distributed Bern(p). We know the following: following:
Multinomial. For example, Xi + Xj ⇠ Bin(n, pi + pj ) for i 6= j since
we can define “success” to mean being in category i or j. Similarly, if
Story A trial is performed with probability p of “success”, and X is Story In a population of w desired objects and b undesired objects, k = 6 and we lump categories 1-2 and lump categories 3-5, then
the indicator of success: 1 means success, 0 means failure. X is the number of “successes” we will have in a draw of n objects, (X1 + X2 , X3 + X4 + X5 , X6 ) ⇠ Mult3 (n, (p1 + p2 , p3 + p4 + p5 , p6 ))
Example Let X be the indicator of Heads for a fair coin toss. Then without replacement. The draw of n objects is assumed to be a
simple random sample (all sets of n objects are equally likely). Conditioning on some Xj also still gives a Multinomial:
X ⇠ Bern( 12 ). Also, 1 X ⇠ Bern( 12 ) is the indicator of Tails.
Examples Here are some HGeom examples. p1 pk 1
Binomial Distribution X1 , . . . , Xk 1 |Xk = nk ⇠ Multk 1 n nk , ,...,
1 pk 1 pk
✓ ✓ ◆◆

• Let’s say that we have only b Weedles (failure) and w Pikachus


Bin(10,1/2) (success) in Viridian Forest. We encounter n Pokemon in the Variances and Covariances We have Xi ⇠ Bin(n, pi ) marginally, so
forest, and X is the number of Pikachus in our encounters.

0.30
Var(Xi ) = npi (1 pi ). Also, Cov(Xi , Xj ) = npi pj for i 6= j.

0.25
• The number of Aces in a 5 card hand. Multivariate Uniform Distribution
● ●

0.20
• You have w white balls and b black balls, and you draw n balls. See the univariate Uniform for stories and examples. For the 2D

pmf
0.15
You will draw X white balls. Uniform on some region, probability is proportional to area. Every
● ●

0.10
• You have w white balls and b black balls, and you draw n balls point in the support has equal density, of value area of1 region . For the
● ●

0.05
without replacement. The number of white balls in your sample 3D Uniform, probability is proportional to volume.
● ●
● ● is HGeom(w, b, n); the number of black balls is HGeom(b, w, n).

0.00
0 2 4 6 8 10
• Capture-recapture A forest has N elk, you capture n of them, Multivariate Normal (MVN) Distribution
x
tag them, and release them. Then you recapture a new sample A vector X
~ = (X1 , X2 , . . . , Xk ) is Multivariate Normal if every linear
Let us say that X is distributed Bin(n, p). We know the following: of size m. How many tagged elk are now in the new sample? combination is Normally distributed, i.e., t1 X1 + t2 X2 + · · · + tk Xk is
Story X is the number of “successes” that we will achieve in n HGeom(n, N n, m) Normal for any constants t1 , t2 , . . . , tk . The parameters of the
independent trials, where each trial is either a success or a failure, each Multivariate Normal are the mean vector µ ~ = (µ1 , µ2 , . . . , µk ) and
with the same probability p of success. We can also write X as a sum Poisson Distribution the covariance matrix where the (i, j) entry is Cov(Xi , Xj ).
of multiple independent Bern(p) random variables. Let X ⇠ Bin(n, p)
and Xj ⇠ Bern(p), where all of the Bernoullis are independent. Then Let us say that X is distributed Pois( ). We know the following: Properties The Multivariate Normal has the following properties.
Story There are rare events (low probability events) that occur many • Any subvector is also MVN.
X = X1 + X2 + X3 + · · · + Xn di↵erent ways (high possibilities of occurences) at an average rate of
occurrences per unit space or time. The number of events that occur • If any two elements within an MVN are uncorrelated, then they
Example If Jeremy Lin makes 10 free throws and each one in that unit of space or time is X. are independent.
independently has a 34 chance of getting in, then the number of free
Example A certain busy intersection has an average of 2 accidents
• The joint PDF of a Bivariate Normal (X, Y ) with N (0, 1)
throws he makes is distributed Bin(10, 34 ).
per month. Since an accident is a low probability event that can
marginal distributions and correlation ⇢ 2 ( 1, 1) is
Properties Let X ⇠ Bin(n, p), Y ⇠ Bin(m, p) with X ?
? Y. happen many di↵erent ways, it is reasonable to model the number of 1 1 2 2
fX,Y (x, y) = exp 2
(x + y 2⇢xy) ,
accidents in a month at that intersection as Pois(2). Then the number 2⇡⌧ 2⌧
• Redefine success n p)
✓ ◆

of accidents that happen in two months at that intersection is


X ⇠ Bin(n, 1
• Sum X + Y ⇠ Bin(n + m, p) distributed Pois(4). with ⌧ = 1 ⇢2 .
p
89
Distribution Properties Euler’s Approximation for Harmonic Sums Linearity and First Success
This problem is commonly known as the coupon collector problem.
1 1 1 There are n coupon types. At each draw, you get a uniformly random
Important CDFs 1+ + + ··· + ⇡ log n + 0.577 . . .
2 3 n coupon type. What is the expected number of coupons needed until
Standard Normal you have a complete set? Answer: Let N be the number of coupons
x
Exponential( ) F (x) = 1 e , for x 2 (0, 1) Stirling’s Approximation for Factorials needed; we want E(N ). Let N = N1 + · · · + Nn , where N1 is the
draws to get our first new coupon, N2 is the additional draws needed
to draw our second new coupon and so on. By the story of the First
Uniform(0,1) F (x) = x, for x 2 (0, 1)
p n
Convolutions of Random Variables n! ⇡ 2⇡n Success, N2 ⇠ FS((n 1)/n) (after collecting first coupon type, there’s
e (n 1)/n chance you’ll get something new). Similarly,
✓ ◆n

A convolution of n random variables is simply their sum. For the N3 ⇠ FS((n 2)/n), and Nj ⇠ FS((n j + 1)/n). By linearity,
following results, let X and Y be independent.
Miscellaneous Definitions n
+ n n n
1. X ⇠ Pois( 1 ), Y ⇠ Pois( 2) ! X + Y ⇠ Pois( 1 2) E(N ) = E(N1 ) + · · · + E(Nn ) = + + ··· + = n
n n 1 1 j=1
j
X1

2. X ⇠ Bin(n1 , p), Y ⇠ Bin(n2 , p) ! X + Y ⇠ Bin(n1 + n2 , p).


Bin(n, p) can be thought of as a sum of i.i.d. Bern(p) r.v.s. Medians and Quantiles Let X have CDF F . Then X has median This is approximately n(log(n) + 0.577) by Euler’s approximation.
3. X ⇠ Gamma(a1 , ), Y ⇠ Gamma(a2 , ) m if F (m) 0.5 and P (X m) 0.5. For X continuous, m satisfies
! X + Y ⇠ Gamma(a1 + a2 , ). Gamma(n, ) with n an F (m) = 1/2. In general, the ath quantile of X is min{x : F (x) a}; Orderings of i.i.d. random variables
integer can be thought of as a sum of i.i.d. Expo( ) r.v.s. the median is the case a = 1/2.
I call 2 UberX’s and 3 Lyfts at the same time. If the time it takes for
4. X ⇠ NBin(r1 , p), Y ⇠ NBin(r2 , p) log Statisticians generally use log to refer to natural log (i.e., base e). the rides to reach me are i.i.d., what is the probability that all the
! X + Y ⇠ NBin(r1 + r2 , p). NBin(r, p) can be thought of as Lyfts will arrive first? Answer: Since the arrival times of the five cars
a sum of i.i.d. Geom(p) r.v.s. i.i.d r.v.s Independent, identically-distributed random variables. are i.i.d., all 5! orderings of the arrivals are equally likely. There are
5. X ⇠ N (µ1 , 12 ), Y ⇠ N (µ2 , 22 ) 3!2! orderings that involve the Lyfts arriving first, so the probability
2
! X + Y ⇠ N (µ1 + µ2 , 12 + 2) 3!2!
Example Problems that the Lyfts arrive first is = 1/10 . Alternatively, there are 53
5!
Special Cases of Distributions ways to choose 3 of the 5 slots for the Lyfts to occupy, where each of
1. Bin(1, p) ⇠ Bern(p) the choices are equally likely. One of these choices has all 3 of the
Contributions from Sebastian Chiu
Lyfts arriving first, so the probability is 1/ = 1/10 .
2. Beta(1, 1) ⇠ Unif(0, 1)
3
⇣ 5⌘

3. Gamma(1, ) ⇠ Expo( )
2 n 1
Calculating Probability
4. n ⇠ Gamma 2, 2 Expectation of Negative Hypergeometric
A textbook has n typos, which are randomly scattered amongst its n
What is the expected number of cards that you draw before you pick
5. NBin(1, p) ⇠ Geom(p)
pages, independently. You pick a random page. What is the
1 your first Ace in a shu✏ed deck (not counting the Ace)? Answer:
Inequalities probability that it has no typos? Answer: There is a 1 n
probability that any specific typo isn’t on your page, and thus a Consider a non-Ace. Denote this to be card j. Let Ij be the indicator
E(X 2 )E(Y 2 ) that card j will be drawn before the first Ace. Note that Ij = 1 says
1 n
1. Cauchy-Schwarz |E(XY )| 
1 probability that there are no typos on your page. For n that j is before all 4 of the Aces in the deck. The probability that this
E|X|
p

2. Markov P (X a for a > 0 n occurs is 1/5 by symmetry. Let X be the number of cards drawn
✓ ◆

a) 
2 2 1 before the first Ace. Then X = I1 + I2 + ... + I48 , where each indicator
3. Chebyshev P (|X µ| for E(X) = µ, Var(X) = large, this is approximately e = 1/e.
a)  a2 corresponds to one of the 48 non-Aces. Thus,
4. Jensen E(g(X)) g(E(X)) for g convex; reverse if g is E(X) = E(I1 ) + E(I2 ) + ... + E(I48 ) = 48/5 = 9.6 .
concave Linearity and Indicators (1)
Minimum and Maximum of RVs
Formulas In a group of n people, what is the expected number of distinct
birthdays (month and day)? What is the expected number of birthday What is the CDF of the maximum of n independent Unif(0,1) random
matches? Answer: Let X be the number of distinct birthdays and Ij variables? Answer: Note that for r.v.s X1 , X2 , . . . , Xn ,
Geometric Series be the indicator for the jth day being represented. P (min(X1 , X2 , . . . , Xn ) a) = P (X1 a, X2 a, . . . , Xn a)
n n n
Similarly,
2 n 1 k 1 r E(Ij ) = 1 P (no one born on day j) = 1 (364/365)
1 + r + r + ··· + r = r =
1 r
P (max(X1 , X2 , . . . , Xn )  a) = P (X1  a, X2  a, . . . , Xn  a)
k=0
X1

n We will use this principle to find the CDF of U(n) , where


2 1 By linearity, E(X) = 365 (1 (364/365) ) . Now let Y be the U(n) = max(U1 , U2 , . . . , Un ) and Ui ⇠ Unif(0, 1) are i.i.d.
1 + r + r + ··· = if |r| < 1
1 r number of birthday matches and Ji be the indicator that the ith pair
of people have the same birthday. The probability that any two
P (max(U1 , U2 , . . . , Un )  a) = P (U1  a, U2  a, . . . , Un  a)
Exponential Function (ex ) = P (U1  a)P (U2  a) . . . P (Un  a)
specific people share a birthday is 1/365, so E(Y ) = /365 .
n
1
x xn x2 x3 x n 2 = a
e = =1+x+ + 1+
⇣ n⌘

n!1
+ · · · = lim
n=0
n! 2! 3! n
✓ ◆

1).
X

for 0 < a < 1 (and the CDF is 0 for a  0 and 1 for a


Gamma and Beta Integrals Linearity and Indicators (2)
Pattern-matching with ex Taylor series
You can sometimes solve complicated-looking integrals by This problem is commonly known as the hat-matching problem. 1
pattern-matching to a gamma or beta integral: There are n people at a party, each with hat. At the end of the party, For X ⇠ Pois( ), find E . Answer: By LOTUS,
they each leave with a random hat. What is the expected number of X +1
✓ ◆

t 1 x a 1 b 1 (a) (b)
x e dx = (t) x (1 x) dx = people who leave with the right hat? Answer: Each hat has a 1/n 1 k 1 k+1
0 0 (a + b) chance of going to the right person. By linearity, the average number 1 1 e e e
E = = = (e 1)
Z 1 Z 1

X +1 k=0
k+1 k! k=0
(k + 1)!
✓ ◆

Also, (a + 1) = a (a), and (n) = (n 1)! if n is a positive integer. of hats that go to their owners is n(1/n) = 1 .
X X
90
Adam’s Law and Eve’s Law 4. Calculating expectation. If it has a named distribution,
check out the table of distributions. If it’s a function of an r.v.
William really likes speedsolving Rubik’s Cubes. But he’s pretty bad with a named distribution, try LOTUS. If it’s a count of
at it, so sometimes he fails. On any given day, William will attempt something, try breaking it up into indicator r.v.s. If you can
N ⇠ Geom(s) Rubik’s Cubes. Suppose each time, he has probability p condition on something natural, consider using Adam’s law.
of solving the cube, independently. Let T be the number of Rubik’s
Cubes he solves during a day. Find the mean and variance of T . 5. Calculating variance. Consider independence, named
Answer: Note that T |N ⇠ Bin(N, p). So by Adam’s Law, Robber distributions, and LOTUS. If it’s a count of something, break it
up into a sum of indicator r.v.s. If it’s a sum, use properties of
p(1 s) covariance. If you can condition on something natural, consider
E(T ) = E(E(T |N )) = E(N p) = using Eve’s Law.
s
6. Calculating E(X 2 ). Do you already know E(X) or Var(X)?
Similarly, by Eve’s Law, we have that Recall that Var(X) = E(X 2 ) (E(X))2 . Otherwise try
LOTUS.
Var(T ) = E(Var(T |N )) + Var(E(T |N )) = E(N p(1 p)) + Var(N p)
7. Calculating covariance. Use the properties of covariance. If
2
(a) Is this Markov chain irreducible? Is it aperiodic? Answer:
p(1 p)(1 s) p (1 s) p(1 s)(p + s(1 p)) you’re trying to find the covariance between two components of
= + = Yes to both. The Markov chain is irreducible because it can a Multinomial distribution, Xi , Xj , then the covariance is
s s2 s2
get from anywhere to anywhere else. The Markov chain is npi pj for i 6= j.
aperiodic because the robber can return back to a square in
8. Symmetry. If X1 , . . . , Xn are i.i.d., consider using symmetry.
MGF – Finding Moments 2, 3, 4, 5, . . . moves, and the GCD of those numbers is 1.
3 (b) What is the stationary distribution of this Markov chain? 9. Calculating probabilities of orderings. Remember that all
Find E(X ) for X ⇠ Expo( ) using the MGF of X. Answer: The
Answer: Since this is a random walk on an undirected graph, n! ordering of i.i.d. continuous random variables X1 , . . . , Xn
MGF of an Expo( ) is M (t) = t . To get the third moment, we can are equally likely.
the stationary distribution is proportional to the degree
take the third derivative of the MGF and evaluate at t = 0: sequence. The degree for the corner pieces is 3, the degree for 10. Determining independence. There are several equivalent
the edge pieces is 4, and the degree for the center pieces is 6. definitions. Think about simple and extreme cases to see if you
3 6 To normalize this degree sequence, we divide by its sum. The
E(X ) = 3
can find a counterexample.
sum of the degrees is 6(3) + 6(4) + 7(6) = 84. Thus the
stationary probability of being on a corner is 3/84 = 1/28, on 11. Do a painful integral. If your integral looks painful, see if
But a much nicer way to use the MGF here is via pattern recognition: an edge is 4/84 = 1/21, and in the center is 6/84 = 1/14. you can write your integral in terms of a known PDF (like
note that M (t) looks like it came from a geometric series: Gamma or Beta), and use the fact that PDFs integrate to 1?
(c) What fraction of the time will the robber be in the center tile
1 12. Before moving on. Check some simple and extreme cases,
1 t n! tn in this game, in the long run? Answer: By the above, 1/14 .
= = check whether the answer seems plausible, check for biohazards.
t
1 n=0 n=0
n n!
1 ✓ ◆n

(d) What is the expected amount of moves it will take for the
X X

robber to return to the center tile? Answer: Since this chain is Biohazards
tn
The coefficient of n!here is the nth moment of X, so we have irreducible and aperiodic, to get the expected time to return we
E(X n ) = n!n for all nonnegative integers n. can just invert the stationary probability. Thus on average it
Contributions from Jessy Hwang
will take 14 turns for the robber to return to the center tile.
Markov chains (1) 1. Don’t misuse the naive definition of probability. When
Problem-Solving Strategies answering “What is the probability that in a group of 3 people,
Suppose Xn is a two-state Markov chain with transition matrix no two have the same birth month?”, it is not correct to treat
the people as indistinguishable balls being placed into 12 boxes,
0 1 Contributions from Jessy Hwang, Yuan Jiang, Yuqi Hou since that assumes the list of birth months {January, January,
0 1 ↵ ↵ 1. Getting started. Start by defining relevant events and
Q= January} is just as likely as the list {January, April, June},
1 1 random variables. (“Let A be the event that I pick the fair even though the latter is six times more likely.
✓ ◆

coin”; “Let X be the number of successes.”) Clear notion is


Find the stationary distribution ~ sQ = ~
s = (s0 , s1 ) of Xn by solving ~ s, 2. Don’t confuse unconditional, conditional, and joint
important for clear thinking! Then decide what it is that you’re P (B|A)P (A)
and show that the chain is reversible with respect to ~ s. Answer: The supposed to be finding, in terms of your notation (“I want to probabilities. In applying P (A|B) = P (B)
, it is not
equation ~
sQ = ~s says that find P (X = 3|A)”). Think about what type of object your correct to say “P (B) = 1 because we know B happened”; P (B)
answer should be (a number? A random variable? A PMF? A is the prior probability of B. Don’t confuse P (A|B) with
s0 = s0 (1 ↵) + s1 and s1 = s0 (↵) + s0 (1 ) PDF?) and what it should be in terms of. P (A, B).
By solving this system of linear equations, we have Try simple and extreme cases. To make an abstract experiment 3. Don’t assume independence without justification. In the
more concrete, try drawing a picture or making up numbers matching problem, the probability that card 1 is a match and
↵ that could have happened. Pattern recognition: does the card 2 is a match is not 1/n2 . Binomial and Hypergeometric
~
s= , structure of the problem resemble something we’ve seen before? are often confused; the trials are independent in the Binomial
↵+ ↵+
✓ ◆

2. Calculating probability of an event. Use counting story and dependent in the Hypergeometric story.
principles if the naive definition of probability applies. Is the
To show that the chain is reversible with respect to ~ s, we must show 4. Don’t forget to do sanity checks. Probabilities must be
probability of the complement easier to find? Look for
si qij = sj qji for all i, j. This is done if we can show s0 q01 = s1 q10 . between 0 and 1. Variances must be 0. Supports must make
symmetries. Look for something to condition on, then apply
And indeed, sense. PMFs must sum to 1. PDFs must integrate to 1.
↵ Bayes’ Rule or the Law of Total Probability.
s0 q01 = = s1 q10 3. Finding the distribution of a random variable. First make 5. Don’t confuse random variables, numbers, and events.
↵+ Let X be an r.v. Then g(X) is an r.v. for any function g. In
sure you need the full distribution not just the mean (see next
Markov chains (2) item). Check the support of the random variable: what values particular, X 2 , |X|, F (X), and IX>3 are r.v.s.
can it take on? Use this to rule out distributions that don’t fit. P (X 2 < X|X 0), E(X), Var(X), and g(E(X)) are numbers.
William and Sebastian play a modified game of Settlers of Catan, Is there a story for one of the named distributions that fits the 1 are events. It does not make sense to
where every turn they randomly move the robber (which starts on the problem at hand? Can you write the random variable as a write 11 F (X)dx, because F (X) is a random variable. It does
center tile) to one of the adjacent hexagons. function of an r.v. with a known distribution, say Y = g(X)? not make sense to write P (X), because X is not an event.
X = 2R and F (X)
91
6. Don’t confuse a random variable with its distribution. Recommended Resources
To get the PDF of X 2 , you can’t just square the PDF of X.
The right way is to use transformations. To get the PDF of
X + Y , you can’t just add the PDF of X and the PDF of Y .
The right way is to compute the convolution. • Introduction to Probability Book
(http://bit.ly/introprobability)
7. Don’t pull non-linear functions out of expectations. • Stat 110 Online (http://stat110.net)
E(g(X)) does not equal g(E(X)) in general. The St. • Stat 110 Quora Blog (https://stat110.quora.com/)
Petersburg paradox is an extreme example. See also Jensen’s
inequality. The right way to find E(g(X)) is with LOTUS. • Quora Probability FAQ (http://bit.ly/probabilityfaq)
• R Studio (https://www.rstudio.com)
• LaTeX File (github.com/wzchen/probability cheatsheet)

Please share this cheatsheet with friends!


http://wzchen.com/probability-cheatsheet

Distributions in R

Command What it does


help(distributions) shows documentation on distributions
dbinom(k,n,p) PMF P (X = k) for X ⇠ Bin(n, p)
pbinom(x,n,p) CDF P (X  x) for X ⇠ Bin(n, p)
qbinom(a,n,p) ath quantile for X ⇠ Bin(n, p)
rbinom(r,n,p) vector of r i.i.d. Bin(n, p) r.v.s
dgeom(k,p) PMF P (X = k) for X ⇠ Geom(p)
dhyper(k,w,b,n) PMF P (X = k) for X ⇠ HGeom(w, b, n)
dnbinom(k,r,p) PMF P (X = k) for X ⇠ NBin(r, p)
dpois(k,r) PMF P (X = k) for X ⇠ Pois(r)
dbeta(x,a,b) PDF f (x) for X ⇠ Beta(a, b)
dchisq(x,n) PDF f (x) for X ⇠ 2n
dexp(x,b) PDF f (x) for X ⇠ Expo(b)
dgamma(x,a,r) PDF f (x) for X ⇠ Gamma(a, r)
dlnorm(x,m,s) PDF f (x) for X ⇠ LN (m, s2 )
dnorm(x,m,s) PDF f (x) for X ⇠ N (m, s2 )
dt(x,n) PDF f (x) for X ⇠ tn
dunif(x,a,b) PDF f (x) for X ⇠ Unif(a, b)

The table above gives R commands for working with various named
distributions. Commands analogous to pbinom, qbinom, and rbinom
work for the other distributions in the table. For example, pnorm,
qnorm, and rnorm can be used to get the CDF, quantiles, and random
generation for the Normal. For the Multinomial, dmultinom can be used
for calculating the joint PMF and rmultinom can be used for generating
random vectors. For the Multivariate Normal, after installing and
loading the mvtnorm package dmvnorm can be used for calculating the
joint PDF and rmvnorm can be used for generating random vectors.
92
Table of Distributions

Distribution PMF/PDF and Support Expected Value Variance MGF

Bernoulli P (X = 1) = p
Bern(p) P (X = 0) = q = 1 p p pq q + pet

n k
Binomial P (X = k) = k
pk q n
Bin(n, p) k 2 {0, 1, 2, . . . n} np npq (q + pet )n

Geometric P (X = k) = q k p
p
Geom(p) k 2 {0, 1, 2, . . . } q/p q/p2 1 qet
, qet < 1

r+n 1
Negative Binomial P (X = n) = r 1
pr q n
p
NBin(r, p) n 2 {0, 1, 2, . . . } rq/p rq/p2 (1 qet
)r , qet < 1

w b w+b
Hypergeometric P (X = k) = k n k
/ n
⇣ ⌘⇣ ⌘ ⇣ ⌘

nw w+b n µ µ
HGeom(w, b, n) k 2 {0, 1, 2, . . . , n} µ= b+w w+b 1
nn (1 n
) messy
⇣ ⌘

k
e
Poisson P (X = k) = k!
Pois( ) k 2 {0, 1, 2, . . . } e (et 1)

1
Uniform f (x) = b a
a+b (b a)2 etb eta
Unif(a, b) x 2 (a, b) 2 12 t(b a)

2)
f (x) = p1 e (x µ)2/(2
Normal 2⇡ 2 t2
N (µ, 2 ) x 2 ( 1, 1) µ 2 etµ+ 2

Exponential f (x) = e x
1 1
Expo( ) x 2 (0, 1) 2 t
,t<

1 x1
f (x) = (a)
( x)a e x
Gamma
a a
Gamma(a, ) x 2 (0, 1) 2 t
,t<
⇣ ⌘a

(a+b) a 1 1
f (x) = (a) (b)
x (1 x)b
Beta
a µ(1 µ)
Beta(a, b) x 2 (0, 1) µ= a+b (a+b+1)
messy

1 (log x µ)2 /(2 2 )


Log-Normal p e
x 2⇡
2 2
LN (µ, 2 ) x 2 (0, 1) ✓ = eµ+ /2 ✓ 2 (e 1) doesn’t exist

1
Chi-Square xn/2 1 e x/2
2n/2 (n/2)
2 n 2n (1 2t) n/2 , t < 1/2
n x 2 (0, 1)

((n+1)/2) (n+1)/2
p
n⇡ (n/2)
(1 + x2 /n)
Student-t
n
tn x 2 ( 1, 1) 0 if n > 1 n 2
if n > 2 doesn’t exist
93
CS 229 – Machine Learning https://stanford.edu/~shervine

VIP Refresher: Linear Algebra and Calculus • outer product: for x œ Rm , y œ Rn , we have:

··· x1 yn
Afshine Amidi and Shervine Amidi xy T = .. ..
. .
œ Rm◊n
xm y 1 xm y n
3 x1 y1 4

···
October 6, 2018
r Matrix-vector multiplication – The product of matrix A œ Rm◊n and vector x œ Rn is a
vector of size Rm , such that:
General notations
aT
r,1 x n
r Vector – We note x œ Rn a vector with n entries, where xi œ R is the ith entry:
..
.
ac,i xi œ Rm
x2
Q R

T
ar,m x i=1
x= ..
ÿ

œ Rn
.
A x1 B Ax = a b=

xn
where aT
r,i are the vector rows and ac,j are the vector columns of A, and xi are the entries
r Matrix – We note A œ Rm◊n a matrix with m rows and n columns, where Ai,j œ R is the of x.
entry located in the ith row and j th column: r Matrix-matrix multiplication – The product of matrices A œ Rm◊n and B œ Rn◊p is a
1,1 · · · A1,n matrix of size Rn◊p , such that:
A= .. ..
. . aT aT
œ Rm◊n
r,1 bc,1 r,1 bc,p n
A A B

···
n◊p
Am,1 · · · Am,n
.. .. ac,i bT
. . r,i œ R
Q R

T T i=1
Remark: the vector x defined above can be viewed as a n ◊ 1 matrix and is more particularly
called a column-vector. ar,m bc,1 ar,m bc,p
ÿ

···
AB = a b=

T
r Identity matrix – The identity matrix I œ Rn◊n is a square matrix with ones in its diagonal
and zero everywhere else: where aT
r,i , br,i are the vector rows and ac,j , bc,j are the vector columns of A and B respec-
tively.
.. .. .
. r Transpose – The transpose of a matrix A œ Rm◊n , noted AT , is such that its entries are
flipped:
Q 1 0 ··· 0 R

.. . . ..
. . . 0
0 1 AT
i,j = Aj,i
c . .. d

0 ··· ’i,j,
I=a 0 b

Remark: for all matrices A œ Rn◊n , we have A ◊ I = I ◊ A = A.


Remark: for matrices A,B, we have (AB)T = B T AT .
r Diagonal matrix – A diagonal matrix D œ Rn◊n is a square matrix with nonzero values in
its diagonal and zero everywhere else: r Inverse – The inverse of an invertible square matrix A is noted A≠1 and is the only matrix
0 such that:
.. .. ..
. . . AA≠1 = A≠1 A = I
Q d1 0 · · · R

. .. ..
.. . . 0 Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)≠1 =
0 0 dn
c d

···
B ≠1 A≠1
D=a 0 b

Remark: we also note D as diag(d1 ,...,dn ).


r Trace – The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:
n
Matrix operations
tr(A) = Ai,i
r Vector-vector multiplication – There are two types of vector-vector products: i=1
ÿ

Remark: for matrices A,B, we have tr(AT ) = tr(A) and tr(AB) = tr(BA)
• inner product: for x,y œ Rn , we have:

n
r Determinant – The determinant of a square matrix A œ Rn◊n , noted |A| or det(A) is
xT y = xi yi œ R expressed recursively in terms of A\i,\j , which is the matrix A without its ith row and j th
i=1 column, as follows:
ÿ

Stanford University 1 Fall 2018


94
CS 229 – Machine Learning https://stanford.edu/~shervine

n
A = AT and ’x œ Rn , xT Ax > 0
det(A) = |A| = (≠1)i+j Ai,j |A\i,\j |
j=1 Remark: similarly, a matrix A is said to be positive definite, and is noted A º 0, if it is a PSD
ÿ

matrix which satisfies for all non-zero vector x, xT Ax > 0.


Remark: A is invertible if and only if |A| ”= 0. Also, |AB| = |A||B| and |AT | = |A|.
r Eigenvalue, eigenvector – Given a matrix A œ Rn◊n , ⁄ is said to be an eigenvalue of A if
there exists a vector z œ Rn \{0}, called eigenvector, such that we have:
Matrix properties Az = ⁄z
r Symmetric decomposition – A given matrix A can be expressed in terms of its symmetric
and antisymmetric parts as follows: r Spectral theorem – Let A œ Rn◊n . If A is symmetric, then A is diagonalizable by a real
orthogonal matrix U œ Rn◊n . By noting = diag(⁄1 ,...,⁄n ), we have:
A + AT A ≠ AT
A= +
2 2 ÷ diagonal, A = U UT
Symmetric Antisymmetric
r Singular-value decomposition – For a given matrix A of dimensions m ◊ n, the singular-
value decomposition (SVD) is a factorization technique that guarantees the existence of U m◊m
¸ ˚˙ ˝ ¸ ˚˙ ˝

r Norm – A norm is a function N : V ≠æ [0, + Œ[ where V is a vector space, and such that unitary, m ◊ n diagonal and V n ◊ n unitary matrices, such that:
for all x,y œ V , we have:
A=U VT
• N (x + y) 6 N (x) + N (y)

• N (ax) = |a|N (x) for a scalar Matrix calculus


• if N (x) = 0, then x = 0
r Gradient – Let f : Rm◊n æ R be a function and A œ Rm◊n be a matrix. The gradient of f
For x œ V , the most commonly used norms are summed up in the table below: with respect to A is a m ◊ n matrix, noted ÒA f (A), such that:
ˆf (A)
Norm Notation Definition Use case ÒA f (A) =
i,j ˆAi,j
n
1 2

Manhattan, L1 ||x||1 |xi | LASSO regularization Remark: the gradient of f is only defined when f is a function that returns a scalar.
i=1 r Hessian – Let f : Rn æ R be a function and x œ Rn be a vector. The hessian of f with
ÿ

respect to x is a n ◊ n symmetric matrix, noted Ò2x f (x), such that:


ˆ 2 f (x)
Euclidean, L2 ||x||2 x2i Ridge regularization Ò2x f (x) =
i,j ˆxi ˆxj
i=1
ı̂ n 1 2
ıÿ
Ù

Remark: the hessian of f is only defined when f is a function that returns a scalar.

p-norm, Lp ||x||p xpi Hölder inequality r Gradient operations – For matrices A,B,C, the following gradient properties are worth
having in mind:
A n B p1

i=1
ÿ

Uniform convergence
ÒA tr(AB) = B T ÒAT f (A) = (ÒA f (A))T
Infinity, LŒ
i
||x||Πmax |xi |

ÒA tr(ABAT C) = CAB + C T AB T ÒA |A| = |A|(A≠1 )T


r Linearly dependence – A set of vectors is said to be linearly dependent if one of the vectors
in the set can be defined as a linear combination of the others.
Remark: if no vector can be written this way, then the vectors are said to be linearly independent.
r Matrix rank – The rank of a given matrix A is noted rank(A) and is the dimension of the
vector space generated by its columns. This is equivalent to the maximum number of linearly
independent columns of A.

r Positive semi-definite matrix – A matrix A œ Rn◊n is positive semi-definite (PSD) and


is noted A ≤ 0 if we have:

Stanford University 2 Fall 2018


95
96
Calculus Cheat Sheet

Limits
Definitions
Precise Definition : We say lim f ( x ) = L if Limit at Infinity : We say lim f ( x ) = L if we
x Æa x Æ•
for every e > 0 there is a d > 0 such that can make f ( x ) as close to L as we want by
whenever 0 < x - a < d then f ( x ) - L < e . taking x large enough and positive.

“Working” Definition : We say lim f ( x ) = L There is a similar definition for lim f ( x ) = L


x Æa x Æ-•

if we can make f ( x ) as close to L as we want except we require x large and negative.


by taking x sufficiently close to a (on either side
of a) without letting x = a . Infinite Limit : We say lim f ( x ) = • if we
x Æa

can make f ( x ) arbitrarily large (and positive)


Right hand limit : lim+ f ( x ) = L . This has by taking x sufficiently close to a (on either side
x Æa
the same definition as the limit except it of a) without letting x = a .
requires x > a .
There is a similar definition for lim f ( x ) = -•
x Æa
Left hand limit : lim- f ( x ) = L . This has the
x Æa except we make f ( x ) arbitrarily large and
same definition as the limit except it requires negative.
x<a.
Relationship between the limit and one-sided limits
lim f ( x ) = L fi lim+ f ( x ) = lim- f ( x ) = L lim+ f ( x ) = lim- f ( x ) = L fi lim f ( x ) = L
x Æa x Æa x Æa x Æa x Æa x Æa

lim f ( x ) π lim- f ( x ) fi lim f ( x ) Does Not Exist


x Æa + x Æa x Æa

Properties
Assume lim f ( x ) and lim g ( x ) both exist and c is any number then,
x Æa x Æa

1. lim ÈÎcf ( x ) ˘˚ = c lim f ( x ) È f ( x ) ˘ lim f ( x)


x Æa x Æa 4. lim Í ˙ = x Æa provided lim g ( x ) π 0
x Æa g ( x ) g ( x)
˚ lim
x Æa
Î x Æa
2. lim ÈÎ f ( x ) ± g ( x ) ˘˚ = lim f ( x ) ± lim g ( x ) n
5. lim ÈÎ f ( x ) ˘˚ = Èlim f ( x ) ˘
n
x Æa xÆa x Æa
x Æa Î x Æa ˚
3. lim ÈÎ f ( x ) g ( x ) ˘˚ = lim f ( x ) lim g ( x ) 6. lim È n f ( x ) ˘ = n lim f ( x )
x Æa x Æa x Æa x Æa Î ˚ xÆa

Basic Limit Evaluations at ± •


Note : sgn ( a ) = 1 if a > 0 and sgn ( a ) = -1 if a < 0 .
1. lim e x = • & lim e x = 0 5. n even : lim x n = •
xÆ• xÆ- • x Ʊ •

2. lim ln ( x ) = • & lim ln ( x ) = - • 6. n odd : lim x n = • & lim x n = -•


x Æ• x Æ0 + x Æ• x Æ- •

3. If r > 0 then lim


b
=0 7. n even : lim a x + L + b x + c = sgn ( a ) •
n
x Ʊ •
xr
x Æ•
8. n odd : lim a x n + L + b x + c = sgn ( a ) •
4. If r > 0 and x r is real for negative x x Æ•
b
then lim r = 0 9. n odd : lim a x n + L + c x + d = - sgn ( a ) •
x Æ-•
x Æ-• x

Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
97
Calculus Cheat Sheet

Evaluation Techniques
Continuous Functions L’Hospital’s Rule
If f ( x ) is continuous at a then lim f ( x ) = f ( a ) f ( x) 0 f ( x) ± •
x Æa If lim = or lim = then,
x Æa g ( x ) 0 x Æa g ( x ) ±•
Continuous Functions and Composition f ( x) f ¢( x)
f ( x ) is continuous at b and lim g ( x ) = b then lim = lim a is a number, • or -•
x Æa g ( x ) x Æa g ¢ ( x )
x Æa

x Æa
( x Æa
)
lim f ( g ( x ) ) = f lim g ( x ) = f ( b ) Polynomials at Infinity
p ( x ) and q ( x ) are polynomials. To compute
Factor and Cancel
p ( x)
lim
x 2 + 4 x - 12
= lim
( x - 2 )( x + 6 ) lim
x Ʊ • q ( x )
factor largest power of x in q ( x ) out
xÆ2 x - 2x
2 xÆ2 x ( x - 2)
of both p ( x ) and q ( x ) then compute limit.
x+6 8
= lim
xÆ2 x
= =4
Rationalize Numerator/Denominator
2
lim
3x 2 - 4
= lim 2 5
( )
x 2 3 - 42
x
3 - 42
= lim 5 x = -
3

3- x 3- x 3+ x
x Æ-• 5 x - 2 x 2 x Æ-• x
( )
x -2
x Æ- •
x -2 2
lim 2 = lim 2 Piecewise Function
x Æ9 x - 81 x Æ9 x - 81 3 + x
Ï x 2 + 5 if x < -2
= lim 2
9- x
= lim
-1 lim g ( x ) where g ( x ) = Ì
( )
( x - 81) 3 + x xÆ9 ( x + 9 ) 3 + x ( ) Ó1 - 3x if x ≥ -2
x Æ-2
x Æ9

Compute two one sided limits,


-1 1 lim- g ( x ) = lim- x 2 + 5 = 9
= =-
(18)( 6 ) 108 x Æ-2 x Æ-2

lim g ( x ) = lim+ 1 - 3 x = 7
Combine Rational Expressions x Æ-2+ x Æ-2

1Ê 1 1ˆ 1 Ê x - ( x + h) ˆ One sided limits are different so lim g ( x )


lim Á - ˜ = lim ÁÁ ˜ x Æ-2
h Æ0 h x + h
Ë x ¯ hÆ0 h Ë x ( x + h ) ˜¯ doesn’t exist. If the two one sided limits had
been equal then lim g ( x ) would have existed
1 Ê -h ˆ -1 1 x Æ-2
= lim ÁÁ ˜ = lim = -
h Æ0 h x ( x + h ) ˜ hÆ0 x ( x + h ) x2 and had the same value.
Ë ¯

Some Continuous Functions


Partial list of continuous functions and the values of x for which they are continuous.
1. Polynomials for all x. 7. cos ( x ) and sin ( x ) for all x.
2. Rational function, except for x’s that give
division by zero. 8. tan ( x ) and sec ( x ) provided
3. n x (n odd) for all x. 3p p p 3p
x π L , - , - , , ,L
4. n x (n even) for all x ≥ 0 . 2 2 2 2
9. cot ( x ) and csc ( x ) provided
5. e x for all x.
6. ln x for x > 0 . x π L , -2p , -p , 0, p , 2p ,L

Intermediate Value Theorem


Suppose that f ( x ) is continuous on [a, b] and let M be any number between f ( a ) and f ( b ) .
Then there exists a number c such that a < c < b and f ( c ) = M .

Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
98
Calculus Cheat Sheet

Derivatives
Definition and Notation
f ( x + h) - f ( x)
If y = f ( x ) then the derivative is defined to be f ¢ ( x ) = lim .
h Æ0 h

If y = f ( x ) then all of the following are If y = f ( x ) all of the following are equivalent
equivalent notations for the derivative. notations for derivative evaluated at x = a .
df dy d df dy
f ¢ ( x ) = y¢ = = = ( f ( x ) ) = Df ( x ) f ¢ ( a ) = y ¢ x =a = = = Df ( a )
dx dx dx dx x =a dx x =a

Interpretation of the Derivative


If y = f ( x ) then, 2. f ¢ ( a ) is the instantaneous rate of
1. m = f ¢ ( a ) is the slope of the tangent change of f ( x ) at x = a .
line to y = f ( x ) at x = a and the 3. If f ( x ) is the position of an object at
equation of the tangent line at x = a is time x then f ¢ ( a ) is the velocity of
given by y = f ( a ) + f ¢ ( a )( x - a ) . the object at x = a .

Basic Properties and Formulas


If f ( x ) and g ( x ) are differentiable functions (the derivative exists), c and n are any real numbers,
d
1. ( c f )¢ = c f ¢ ( x ) 5. (c) = 0
dx
2. ( f ± g )¢ = f ¢ ( x ) ± g ¢ ( x ) 6.
d n
( x ) = n x n-1 – Power Rule
dx
3. ( f g )¢ = f ¢ g + f g ¢ – Product Rule d
7. ( )
f ( g ( x )) = f ¢ ( g ( x )) g¢ ( x )
Ê f ˆ¢ f ¢ g - f g ¢ dx
4. Á ˜ = – Quotient Rule This is the Chain Rule
Ëg ¯ g2

Common Derivatives
d d d x
dx
( x) = 1
dx
( csc x ) = - csc x cot x
dx
( a ) = a x ln ( a )
d d d x
dx
( sin x ) = cos x
dx
( cot x ) = - csc2 x
dx
( e ) = ex
d d 1 d 1
dx
( cos x ) = - sin x
dx
( sin -1 x ) =
dx
( ln ( x ) ) = , x > 0
x
1 - x2
d d 1
dx
( tan x ) = sec2 x d
( cos -1 x ) = -
1
dx
( ln x ) = x , x π 0
dx 1 - x2
d d 1
dx
( sec x ) = sec x tan x d
( tan -1 x ) =
1
dx
( log a ( x ) ) =
x ln a
, x>0
dx 1 + x2

Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
99
Calculus Cheat Sheet

Chain Rule Variants


The chain rule applied to some specific functions.
1.
d
dx
( ) n -1
ÈÎ f ( x ) ˘˚ = n ÈÎ f ( x ) ˘˚ f ¢ ( x )
n
5.
d
dx
( )
cos ÈÎ f ( x ) ˘˚ = - f ¢ ( x ) sin ÈÎ f ( x ) ˘˚

2.
dx
e (
d f ( x)
)
= f ¢ ( x ) e f ( x) 6.
d
dx
( )
tan ÈÎ f ( x ) ˘˚ = f ¢ ( x ) sec 2 ÈÎ f ( x ) ˘˚

f ( x)
¢ d
3.
d
(
ln ÈÎ f ( x ) ˘˚ = ) 7. ( sec [ f ( x)]) = f ¢( x) sec [ f ( x)] tan [ f ( x)]
dx f ( x) dx
d f ¢( x)
4.
d
( )
sin ÈÎ f ( x ) ˘˚ = f ¢ ( x ) cos ÈÎ f ( x ) ˘˚ 8. (
tan -1 ÈÎ f ( x ) ˘˚ = )
1 + ÈÎ f ( x ) ˘˚
2
dx dx

Higher Order Derivatives


The Second Derivative is denoted as The nth Derivative is denoted as
d2 f dn f
f ¢¢ ( x ) = f ( 2) ( x ) = 2 and is defined as f ( n) ( x ) = n and is defined as
dx dx
f ¢¢ ( x ) = ( f ¢ ( x ) )¢ , i.e. the derivative of the ¢
( )
f ( n) ( x ) = f ( n -1) ( x ) , i.e. the derivative of
first derivative, f ¢ ( x ) . the (n-1)st derivative, f ( n-1) ( x ) .

Implicit Differentiation
¢
Find y if e 2 x -9 y
+ x y = sin ( y ) + 11x . Remember y = y ( x ) here, so products/quotients of x and y
3 2

will use the product/quotient rule and derivatives of y will use the chain rule. The “trick” is to
differentiate as normal and every time you differentiate a y you tack on a y¢ (from the chain rule).
After differentiating solve for y¢ .

e 2 x -9 y ( 2 - 9 y¢ ) + 3 x 2 y 2 + 2 x3 y y¢ = cos ( y ) y¢ + 11
11 - 2e 2 x -9 y - 3 x 2 y 2
2e 2 x -9 y
- 9 y¢e 2 x -9 y
+ 3x y + 2 x y y¢ = cos ( y ) y¢ + 11
2 2 3
fi y¢ = 3
2 x y - 9e2 x -9 y - cos ( y )
( 2 x y - 9e x
3 2 -9 y
- cos ( y ) ) y¢ = 11 - 2e2 x -9 y - 3 x 2 y 2

Increasing/Decreasing – Concave Up/Concave Down


Critical Points
x = c is a critical point of f ( x ) provided either Concave Up/Concave Down
1. If f ¢¢ ( x ) > 0 for all x in an interval I then
1. f ¢ ( c ) = 0 or 2. f ¢ ( c ) doesn’t exist.
f ( x ) is concave up on the interval I.
Increasing/Decreasing 2. If f ¢¢ ( x ) < 0 for all x in an interval I then
1. If f ¢ ( x ) > 0 for all x in an interval I then
f ( x ) is concave down on the interval I.
f ( x ) is increasing on the interval I.
2. If f ¢ ( x ) < 0 for all x in an interval I then Inflection Points
x = c is a inflection point of f ( x ) if the
f ( x ) is decreasing on the interval I.
concavity changes at x = c .
3. If f ¢ ( x ) = 0 for all x in an interval I then
f ( x ) is constant on the interval I.

Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
100
Calculus Cheat Sheet

Extrema
Absolute Extrema Relative (local) Extrema
1. x = c is an absolute maximum of f ( x ) 1. x = c is a relative (or local) maximum of
if f ( c ) ≥ f ( x ) for all x in the domain. f ( x ) if f ( c ) ≥ f ( x ) for all x near c.
2. x = c is a relative (or local) minimum of
2. x = c is an absolute minimum of f ( x )
f ( x ) if f ( c ) £ f ( x ) for all x near c.
if f ( c ) £ f ( x ) for all x in the domain.
1st Derivative Test
Fermat’s Theorem If x = c is a critical point of f ( x ) then x = c is
If f ( x ) has a relative (or local) extrema at
1. a rel. max. of f ( x ) if f ¢ ( x ) > 0 to the left
x = c , then x = c is a critical point of f ( x ) .
of x = c and f ¢ ( x ) < 0 to the right of x = c .
Extreme Value Theorem 2. a rel. min. of f ( x ) if f ¢ ( x ) < 0 to the left
If f ( x ) is continuous on the closed interval of x = c and f ¢ ( x ) > 0 to the right of x = c .
[ a, b] then there exist numbers c and d so that, 3. not a relative extrema of f ( x ) if f ¢ ( x ) is
1. a £ c, d £ b , 2. f ( c ) is the abs. max. in the same sign on both sides of x = c .
[ a, b] , 3. f ( d ) is the abs. min. in [ a, b] . 2nd Derivative Test
If x = c is a critical point of f ( x ) such that
Finding Absolute Extrema
To find the absolute extrema of the continuous f ¢ ( c ) = 0 then x = c
function f ( x ) on the interval [ a, b ] use the 1. is a relative maximum of f ( x ) if f ¢¢ ( c ) < 0 .
following process. 2. is a relative minimum of f ( x ) if f ¢¢ ( c ) > 0 .
1. Find all critical points of f ( x ) in [ a, b] . 3. may be a relative maximum, relative
2. Evaluate f ( x ) at all points found in Step 1. minimum, or neither if f ¢¢ ( c ) = 0 .
3. Evaluate f ( a ) and f ( b ) .
4. Identify the abs. max. (largest function Finding Relative Extrema and/or
value) and the abs. min.(smallest function Classify Critical Points
value) from the evaluations in Steps 2 & 3. 1. Find all critical points of f ( x ) .
2. Use the 1st derivative test or the 2nd
derivative test on each critical point.

Mean Value Theorem


If f ( x ) is continuous on the closed interval [ a, b ] and differentiable on the open interval ( a, b )
f (b) - f ( a )
then there is a number a < c < b such that f ¢ ( c ) = .
b-a

Newton’s Method
f ( xn )
If xn is the nth guess for the root/solution of f ( x ) = 0 then (n+1)st guess is xn +1 = xn -
f ¢ ( xn )
provided f ¢ ( xn ) exists.

Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
101
Calculus Cheat Sheet

Related Rates
Sketch picture and identify known/unknown quantities. Write down equation relating quantities
and differentiate with respect to t using implicit differentiation (i.e. add on a derivative every time
you differentiate a function of t). Plug in known quantities and solve for the unknown quantity.
Ex. A 15 foot ladder is resting against a wall. Ex. Two people are 50 ft apart when one
The bottom is initially 10 ft away and is being starts walking north. The angle q changes at
pushed towards the wall at 14 ft/sec. How fast 0.01 rad/min. At what rate is the distance
is the top moving after 12 sec? between them changing when q = 0.5 rad?

We have q ¢ = 0.01 rad/min. and want to find


x¢ is negative because x is decreasing. Using
x¢ . We can use various trig fcns but easiest is,
Pythagorean Theorem and differentiating,
x x¢
x 2 + y 2 = 152 fi 2 x x¢ + 2 y y¢ = 0 sec q = fi sec q tan q q ¢ =
50 50
After 12 sec we have x = 10 - 12 ( 14 ) = 7 and We know q = 0.5 so plug in q ¢ and solve.

so y = 152 - 7 2 = 176 . Plug in and solve sec ( 0.5 ) tan ( 0.5 )( 0.01) =
for y¢ . 50
7 x¢ = 0.3112 ft/sec
7 ( - 14 ) + 176 y¢ = 0 fi y¢ = ft/sec Remember to have calculator in radians!
4 176

Optimization
Sketch picture if needed, write down equation to be optimized and constraint. Solve constraint for
one of the two variables and plug into first equation. Find critical points of equation in range of
variables and verify that they are min/max as needed.
Ex. We’re enclosing a rectangular field with Ex. Determine point(s) on y = x 2 + 1 that are
500 ft of fence material and one side of the closest to (0,2).
field is a building. Determine dimensions that
will maximize the enclosed area.

Minimize f = d 2 = ( x - 0 ) + ( y - 2 ) and the


2 2

Maximize A = xy subject to constraint of constraint is y = x 2 + 1 . Solve constraint for


x + 2 y = 500 . Solve constraint for x and plug
x 2 and plug into the function.
into area.
x2 = y - 1 fi f = x2 + ( y - 2)
2
A = y ( 500 - 2 y )
x = 500 - 2 y fi
= y -1 + ( y - 2) = y 2 - 3 y + 3
2
= 500 y - 2 y 2
Differentiate and find critical point(s). Differentiate and find critical point(s).
A¢ = 500 - 4 y fi y = 125 f ¢ = 2y -3 fi y = 32
nd
By 2 deriv. test this is a rel. max. and so is By the 2nd derivative test this is a rel. min. and
the answer we’re after. Finally, find x. so all we need to do is find x value(s).
x = 500 - 2 (125 ) = 250 x 2 = 32 - 1 = 12 fi x = ± 12
The dimensions are then 250 x 125. The 2 points are then ( 1
2 )
, 32 and - ( 1
2 )
, 32 .

Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
102
Calculus Cheat Sheet

Integrals
Definitions
Definite Integral: Suppose f ( x ) is continuous Anti-Derivative : An anti-derivative of f ( x )
on [ a, b] . Divide [ a, b ] into n subintervals of is a function, F ( x ) , such that F ¢ ( x ) = f ( x ) .
width D x and choose x from each interval. Indefinite Integral : Ú f ( x ) dx = F ( x ) + c
*
i

b
f (x )D x . where F ( x ) is an anti-derivative of f ( x ) .
Ú a f ( x ) dx = nlim Â
*
Then i
Æ•
i =1

Fundamental Theorem of Calculus


Part I : If f ( x ) is continuous on [ a, b ] then Variants of Part I :
d u( x)
f ( t ) dt = u ¢ ( x ) f ÈÎu ( x ) ˘˚
dx Ú a
x
g ( x ) = Ú f ( t ) dt is also continuous on [ a, b ]
a
d x d b
and g ¢ ( x ) = () ( ) Ú f ( t ) dt = -v¢ ( x ) f ÈÎv ( x ) ˘˚
dx Ú a
f t dt = f x .
dx v( x )

Part II : f ( x ) is continuous on [ a, b ] , F ( x ) is d u( x)
f ( t ) dt = u ¢ ( x ) f [ u ( x ) ] - v¢ ( x ) f [ v ( x ) ]
dx Ú v( x )
an anti-derivative of f ( x ) (i.e. F ( x ) = Ú f ( x ) dx )
b
then Ú f ( x ) dx = F ( b ) - F ( a ) .
a
Properties
Ú f ( x ) ± g ( x ) dx = Ú f ( x ) dx ± Ú g ( x ) dx Ú cf ( x ) dx = c Ú f ( x ) dx , c is a constant
b b b b b
Ú a f ( x ) ± g ( x ) dx = Ú a f ( x ) dx ± Ú a g ( x ) dx Ú a cf ( x ) dx = c Ú a f ( x ) dx , c is a constant
a b
Ú a f ( x ) dx = 0 Ú c dx = c ( b - a )
a
b a b b

Ú a f ( x ) dx = -Úb f ( x ) dx Ú f ( x ) dx £ Ú f ( x )
a a
dx
b c b
Ú f ( x ) dx = Ú f ( x ) dx + Ú f ( x ) dx for any value of c.
a a c
b b
If f ( x ) ≥ g ( x ) on a £ x £ b then Ú f ( x ) dx ≥ Ú g ( x ) dx
a a
b
If f ( x ) ≥ 0 on a £ x £ b then Ú f ( x ) dx ≥ 0
a
b
If m £ f ( x ) £ M on a £ x £ b then m ( b - a ) £ Ú f ( x ) dx £ M ( b - a )
a

Common Integrals
Ú k dx = k x + c Ú cos u du = sin u + c Ú tan u du = ln sec u + c
Ú x dx = n+1 x + c, n π -1 Ú sin u du = - cos u + c Ú sec u du = ln sec u + tan u + c
n 1 n +1

Ú x dx = Ú x dx = ln x + c Ú sec u du = tan u + c Ú a + u du = a tan ( a ) + c


-1 1 2 1 u 1 -1
2 2

Ú a x + b dx = a ln ax + b + c
1 1
Ú sec u tan u du = sec u + c Ú a - u du = sin ( a ) + c
1
2 2
u -1

Ú ln u du = u ln ( u ) - u + c Ú csc u cot udu = - csc u + c


Ú e du = e + c Ú csc u du = - cot u + c
u u 2

Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
103
Calculus Cheat Sheet

Standard Integration Techniques


Note that at many schools all but the Substitution Rule tend to be taught in a Calculus II class.

( )
Ú a f ( g ( x ) ) g ¢ ( x ) dx = Ú g (a ) f ( u ) du
b g b
u Substitution : The substitution u = g ( x ) will convert using

du = g ¢ ( x ) dx . For indefinite integrals drop the limits of integration.

cos ( x3 ) dx cos ( x3 ) dx = Ú
2 2 8
Ex. Ú 1 5x
2
Ú 1 5x
2 5
1 3
cos ( u ) du
u = x 3 fi du = 3 x 2 dx fi x 2 dx = 13 du ( sin (8) - sin (1) )
8
= 53 sin ( u ) 1 = 5
3
x = 1 fi u = 1 = 1 :: x = 2 fi u = 2 = 8
3 3

b b b
Integration by Parts : Ú u dv = uv - Ú v du and Ú a u dv = uv a
- Ú v du . Choose u and dv from
a

integral and compute du by differentiating u and compute v using v = Ú dv .

Ú xe
-x 5
Ex. dx Ex. Ú3 ln x dx
u=x dv = e- x fi du = dx v = -e - x u = ln x dv = dx fi du = 1x dx v = x
Ú xe dx = - xe + Ú e dx = - xe - e
-x -x -x -x -x
+c
ln x dx = x ln x 3 - Ú dx = ( x ln ( x ) - x )
5 5 5 5
Ú3 3 3

= 5ln ( 5) - 3ln ( 3) - 2

Products and (some) Quotients of Trig Functions


For Ú sin n x cos m x dx we have the following : For Ú tan n x sec m x dx we have the following :
1. n odd. Strip 1 sine out and convert rest to 1. n odd. Strip 1 tangent and 1 secant out and
cosines using sin x = 1 - cos x , then use
2 2 convert the rest to secants using
the substitution u = cos x . tan 2 x = sec 2 x - 1 , then use the substitution
2. m odd. Strip 1 cosine out and convert rest u = sec x .
to sines using cos 2 x = 1 - sin 2 x , then use 2. m even. Strip 2 secants out and convert rest
the substitution u = sin x . to tangents using sec2 x = 1 + tan 2 x , then
3. n and m both odd. Use either 1. or 2. use the substitution u = tan x .
4. n and m both even. Use double angle 3. n odd and m even. Use either 1. or 2.
and/or half angle formulas to reduce the 4. n even and m odd. Each integral will be
integral into a form that can be integrated. dealt with differently.
Trig Formulas : sin ( 2 x ) = 2sin ( x ) cos ( x ) , cos 2 ( x ) = 2 (1 + cos ( 2 x ) ) , sin ( x ) = 2 (1 - cos ( 2 x ) )
1 2 1

Ú tan sin5 x
Ú cos x dx
3
Ex. x sec5 x dx Ex. 3

Ú tan x sec5 xdx = Ú tan 2 x sec 4 x tan x sec xdx


3 5 4 2 2
(sin x ) sin x
Ú cos x dx = Ú cos x dx = Ú cos x dx
sin x sin x sin x
3 3 3

= Ú ( sec2 x - 1) sec 4 x tan x sec xdx (1- cos x ) sin x 2 2



cos x
dx 3( u = cos x )
= Ú ( u 2 - 1) u 4 du ( u = sec x ) 2 2
= - Ú (1-u ) du = - Ú 1-2u +u du
2 4

u 3 u 3
= 17 sec7 x - 15 sec5 x + c
= 12 sec2 x + 2 ln cos x - 12 cos 2 x + c

Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
104
Calculus Cheat Sheet

Trig Substitutions : If the integral contains the following root use the given substitution and
formula to convert into an integral involving trig functions.
a 2 - b 2 x 2 fi x = ab sin q b 2 x 2 - a 2 fi x = ba sec q a 2 + b 2 x 2 fi x = ab tan q
cos 2 q = 1 - sin 2 q tan 2 q = sec 2 q - 1 sec2 q = 1 + tan 2 q

Úx Û ( 23 cos q ) dq = Ú sin122 q dq
16
Ex. dx 16
2
4 -9 x 2 ı 4 sin 2 q ( 2cosq )
9
x = 23 sin q fi dx = 23 cos q dq
= Ú 12 csc 2 dq = -12 cot q + c
4 - 9x 2 = 4 - 4sin q = 4 cos q = 2 cos q
2 2
Use Right Triangle Trig to go back to x’s. From
Recall x 2 = x . Because we have an indefinite substitution we have sin q = 32x so,
integral we’ll assume positive and drop absolute
value bars. If we had a definite integral we’d
need to compute q ’s and remove absolute value
bars based on that and,
Ï x if x ≥ 0 4 -9 x 2
x =Ì From this we see that cot q = 3x
. So,
Ó- x if x < 0
4 -9 x 2
Úx dx = - 4 +c
16
In this case we have 4 - 9x = 2 cos q .
2 2
4 -9 x 2 x

P( x )
Partial Fractions : If integrating Ú Q( x) dx where the degree of P ( x ) is smaller than the degree of
Q ( x ) . Factor denominator as completely as possible and find the partial fraction decomposition of
the rational expression. Integrate the partial fraction decomposition (P.F.D.). For each factor in the
denominator we get term(s) in the decomposition according to the following table.

Factor in Q ( x ) Term in P.F.D Factor in Q ( x ) Term in P.F.D


A A1 A2 Ak
ax + b ( ax + b )
k
+ +L +
ax + b ax + b ( ax + b ) 2
( ax + b )
k

Ax + B A1 x + B1 Ak x + Bk
( ax + bx + c ) +L +
2 k
ax 2 + bx + c ax + bx + c ( ax 2 + bx + c )
2 k
ax + bx + c
2

7 x2 +13 x +C A( x2 + 4) + ( Bx + C ) ( x -1)
Ex. Ú ( x -1)( x 2
+4)
dx 7 x2 +13 x
2
( x -1)( x + 4 )
= A
x -1 + Bx
x2 + 4
= ( x -1)( x 2 + 4 )

7 x2 +13 x Set numerators equal and collect like terms.


Ú ( x -1)( x2 + 4 )
dx = Ú x4-1 + 3xx2++164 dx
7 x 2 + 13 x = ( A + B ) x 2 + ( C - B ) x + 4 A - C
= Ú x4-1 + 3x
x2 + 4
+ 16
x2 + 4
dx Set coefficients equal to get a system and solve
= 4 ln x - 1 + 32 ln ( x 2 + 4 ) + 8 tan -1 ( x2 ) to get constants.
A+ B = 7 C - B = 13 4A - C = 0
Here is partial fraction form and recombined.
A=4 B=3 C = 16

An alternate method that sometimes works to find constants. Start with setting numerators equal in
previous example : 7 x 2 + 13x = A ( x 2 + 4 ) + ( Bx + C ) ( x - 1) . Chose nice values of x and plug in.
For example if x = 1 we get 20 = 5A which gives A = 4 . This won’t always work easily.

Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
105
Calculus Cheat Sheet

Applications of Integrals
b
Net Area : Ú a f ( x ) dx represents the net area between f ( x ) and the
x-axis with area above x-axis positive and area below x-axis negative.

Area Between Curves : The general formulas for the two main cases for each are,
b d
y = f ( x) fi A = Ú Èupper function ˘
Î ˚ - ÈÎlower function ˘˚ dx & x = f ( y) fi A = Ú È right function ˘
Î ˚ - ÈÎleft function ˘˚ dy
a c
If the curves intersect then the area of each portion must be found individually. Here are some
sketches of a couple possible situations and formulas for a couple of possible cases.

d
A = Ú f ( y ) - g ( y ) dy
b c b
A = Ú f ( x ) - g ( x ) dx c A = Ú f ( x ) - g ( x ) dx + Ú g ( x ) - f ( x ) dx
a a c

Volumes of Revolution : The two main formulas are V = Ú A ( x ) dx and V = Ú A ( y ) dy . Here is


some general information about each method of computing and some examples.
Rings Cylinders
(
A = p ( outer radius ) - ( inner radius)
2 2
) A = 2p ( radius ) ( width / height )
Limits: x/y of right/bot ring to x/y of left/top ring Limits : x/y of inner cyl. to x/y of outer cyl.
Horz. Axis use f ( x ) , Vert. Axis use f ( y ) , Horz. Axis use f ( y ) , Vert. Axis use f ( x ) ,
g ( x ) , A ( x ) and dx. g ( y ) , A ( y ) and dy. g ( y ) , A ( y ) and dy. g ( x ) , A ( x ) and dx.

Ex. Axis : y = a > 0 Ex. Axis : y = a £ 0 Ex. Axis : y = a > 0 Ex. Axis : y = a £ 0

outer radius : a - f ( x ) outer radius: a + g ( x ) radius : a - y radius : a + y


inner radius : a - g ( x ) inner radius: a + f ( x ) width : f ( y ) - g ( y ) width : f ( y ) - g ( y )

These are only a few cases for horizontal axis of rotation. If axis of rotation is the x-axis use the
y = a £ 0 case with a = 0 . For vertical axis of rotation ( x = a > 0 and x = a £ 0 ) interchange x and
y to get appropriate formulas.

Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
106
Calculus Cheat Sheet

Work : If a force of F ( x ) moves an object Average Function Value : The average value
b
b of f ( x ) on a £ x £ b is f avg = 1
Ú f ( x ) dx
in a £ x £ b , the work done is W = Ú F ( x ) dx b-a a
a

Arc Length Surface Area : Note that this is often a Calc II topic. The three basic formulas are,
b b b
L = Ú ds SA = Ú 2p y ds (rotate about x-axis) SA = Ú 2p x ds (rotate about y-axis)
a a a
where ds is dependent upon the form of the function being worked with as follows.

( ) ( dxdt ) ( )
2 2
dx if y = f ( x ) , a £ x £ b dt if x = f ( t ) , y = g ( t ) , a £ t £ b
dy 2 dy
ds = 1 + dx
ds = + dt

1+ ( ) ds = r 2 + ( ddrq ) dq if r = f (q ) , a £ q £ b
2 2
ds = dx
dy
dy if x = f ( y ) , a £ y £ b
With surface area you may have to substitute in for the x or y depending on your choice of ds to
match the differential in the ds. With parametric and polar you will always need to substitute.

Improper Integral
An improper integral is an integral with one or more infinite limits and/or discontinuous integrands.
Integral is called convergent if the limit exists and has a finite value and divergent if the limit
doesn’t exist or has infinite value. This is typically a Calc II topic.

Infinite Limit
• t b b
1. Ú f ( x ) dx = lim Ú f ( x ) dx 2. Ú• f ( x ) dx = lim Ú f ( x ) dx
a t Æ• a - t Æ-• t
• c •
3. Ú • f ( x ) dx = Ú • f ( x ) dx + Ú
- - c
f ( x ) dx provided BOTH integrals are convergent.
Discontinuous Integrand
b b b t
1. Discont. at a: Ú f ( x ) dx = lim+ Ú f ( x ) dx 2. Discont. at b : Ú f ( x ) dx = lim- Ú f ( x ) dx
a t Æa t a t Æb a
b c b
3. Discontinuity at a < c < b : Ú f ( x ) dx = Ú f ( x ) dx + Ú f ( x ) dx provided both are convergent.
a a c

Comparison Test for Improper Integrals : If f ( x ) ≥ g ( x ) ≥ 0 on [ a, • ) then,


• • • •
1. If Ú f ( x ) dx conv. then Ú g ( x ) dx conv. 2. If Ú g ( x ) dx divg. then Ú f ( x ) dx divg.
a a a a

Useful fact : If a > 0 then Úa dx converges if p > 1 and diverges for p £ 1 .
1
xp

Approximating Definite Integrals


b
For given integral Ú a f ( x ) dx and a n (must be even for Simpson’s Rule) define Dx = b-na and

divide [ a, b] into n subintervals [ x0 , x1 ] , [ x1 , x2 ] , … , [ xn -1 , xn ] with x0 = a and xn = b then,

Ú f ( x ) dx ª Dx ÈÎ f ( x ) + f ( x ) + L + f ( x )˘˚ , xi* is midpoint [ xi -1 , xi ]


b
* * *
Midpoint Rule : 1 2 n
a

b Dx
Trapezoid Rule : Ú f ( x ) dx ª 2 ÈÎ f ( x ) + 2 f ( x ) + +2 f ( x ) + L + 2 f ( x ) + f ( x )˘˚
a
0 1 2 n -1 n

b Dx
Simpson’s Rule : Ú f ( x ) dx ª 3 ÈÎ f ( x ) + 4 f ( x ) + 2 f ( x ) + L + 2 f ( x ) + 4 f ( x ) + f ( x )˘˚
a
0 1 2 n-2 n -1 n

Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
107
Regression (Machine Learning)
R Cheat Sheet

Summary:
Common Applications in Business: Used to predict a numeric value Parsnip (Machine Learning):
(e.g. forecasting sales, estimating prices, etc). 
Key Concept: Data is usually in a rectangular format (like a spreadsheet) Model List (start here first)
with one column that is a target (e.g. price) and other columns that are Linear Regression & GLM
predictors (e.g. product category) Decision Tree
Gotchas:  Random Forest
Preprocessing: Knowing when to preprocess data (normalize) Boosted Trees (XGBoost)
prior to machine learning step
SVM: Poly & Radial
Feature Engineering: Getting good features is more important
than applying complex models.
Keras (Deep Learning)
Parameter Tuning: Higher complexity models have many parameters
that can be tuned.
Interpretability: Some models are more explainable than others, H2O (ML & DL Framework)
meaning the estimates for each feature means something in relation to
the target. Other models are not interpretable and require additional tools MLR (ML Framework)
(e.g. LIME) to explain.
Linear Regression with Multiple Predictors
Each predictor (term) is interpretable meaning the value (estimate)
indicates an increase/decrease in the target
Terminology:
Supervised vs Unsupervised: Regression is a supervised technique
that requires training with a "target" (e.g. price of product or sales by Python Cheat
month). The algorithm learns by identifying relationships between the Sheet
target & the predictors (attributes related to the target like category of
product or month of sales).
Classification vs Regression: Classification aims to predict classes
(either binary yes/no or multi-class categorical). Regression aims to Scikit-Learn (Machine Learning):
predict a numeric value (e.g. product price = $4,233).
Preprocessing: Many algorithms require preprocessing, which Linear Regression
transforms the data into a format more suitable for the machine learning GLM (Elastic Net)
algorithm. A common example is "standardization" or scaling the feature Decision Tree (Regressor)
to be in a range of [0,1] (or close to it). Random Forest (Regressor)
Hyper Parameter & Tuning: Machine learning algorithms have many
AdaBoost (Regressor) ,
parameters that can be adjusted (e.g. learning rate in GBM). Tuning is the
XGBoost
process of systematically finding the optimum parameter values. 
Cross Validation: Machine learning algorithms should be tuned on a SVM (Regressor)
validation set as opposed to a test set. Cross-validation is the process of
splitting the training set into multiple sets using a portion of the training Keras (Deep Learning)
set for tuning. 
Decision Tree (Regression) Performance Metrics (Regression): Common performance metrics are H2O (ML & DL Framework)
Decisions are based on binary rules that split the nodes and terminate Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
at leaves. Regression trees estimate the value at each node. These measures provide an estimate of model performance to compare
models to each other. 

Resources
Business Analysis With R Course (DS4B 101-R) - Data Science Courses
Modeling - Week 6 for Business
Business Science Problem Framework 
Ultimate R Cheat Sheet | Ultimate Python Cheat Sheet
Business Science University
university.business­science.io
108

version: 1.1
Machine Learning Algorithms ­ Regression
Key Attributes Table

Business Science University
Data Science Courses for Business
university.business­science.io
109
CS 229 – Machine Learning https://stanford.edu/~shervine

Least squared Logistic Hinge Cross-entropy


VIP Cheatsheet: Supervised Learning 1
(y ≠ z)2 log(1 + exp(≠yz)) max(0,1 ≠ yz) ≠ y log(z) + (1 ≠ y) log(1 ≠ z)
2
# $

Afshine Amidi and Shervine Amidi

September 9, 2018

Linear regression Logistic regression SVM Neural Network


Introduction to Supervised Learning
r Cost function – The cost function J is commonly used to assess the performance of a model,
Given a set of data points {x(1) , ..., x(m) } associated to a set of outcomes {y (1) , ..., y (m) }, we and is defined with the loss function L as follows:
want to build a classifier that learns how to predict y from x.
m
r Type of prediction – The different types of predictive models are summed up in the table
below: J(◊) = L(h◊ (x(i) ), y (i) )
i=1
ÿ

Regression Classifier
r Gradient descent – By noting – œ R the learning rate, the update rule for gradient descent
Outcome Continuous Class is expressed with the learning rate and the cost function J as follows:
Examples Linear regression Logistic regression, SVM, Naive Bayes ◊ Ω≠ ◊ ≠ –ÒJ(◊)

r Type of model – The different models are summed up in the table below:

Discriminative model Generative model


Goal Directly estimate P (y|x) Estimate P (x|y) to deduce P (y|x)
What’s learned Decision boundary Probability distributions of the data

Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training
example, and batch gradient descent is on a batch of training examples.
Illustration
r Likelihood – The likelihood of a model L(◊) given parameters ◊ is used to find the optimal
parameters ◊ through maximizing the likelihood. In practice, we use the log-likelihood ¸(◊) =
log(L(◊)) which is easier to optimize. We have:
◊opt = arg max L(◊)
Examples Regressions, SVMs GDA, Naive Bayes ◊

r Newton’s algorithm – The Newton’s algorithm is a numerical method that finds ◊ such
Notations and general concepts that ¸Õ (◊) = 0. Its update rule is as follows:
¸Õ (◊)
r Hypothesis – The hypothesis is noted h◊ and is the model that we choose. For a given input ◊Ω◊≠
¸ÕÕ (◊)
data x(i) , the model prediction output is h◊ (x(i) ).
Remark: the multidimensional generalization, also known as the Newton-Raphson method, has
r Loss function – A loss function is a function L : (z,y) œ R ◊ Y ‘≠æ L(z,y) œ R that takes as the following update rule:
inputs the predicted value z corresponding to the real data value y and outputs how different
they are. The common loss functions are summed up in the table below: ◊ Ω ◊ ≠ Ò2◊ ¸(◊) Ò◊ ¸(◊)
! "≠1

Stanford University 1 Fall 2018


110
CS 229 – Machine Learning https://stanford.edu/~shervine

Linear regression Generalized Linear Models

We assume here that y|x; ◊ ≥ N (µ,‡ 2 ) r Exponential family – A class of distributions is said to be in the exponential family if it can
be written in terms of a natural parameter, also called the canonical parameter or link function,
r Normal equations – By noting X the matrix design, the value of ◊ that minimizes the cost
÷, a sufficient statistic T (y) and a log-partition function a(÷) as follows:
function is a closed-form solution such that:

◊ = (X T X)≠1 X T y p(y; ÷) = b(y) exp(÷T (y) ≠ a(÷))

Remark: we will often have T (y) = y. Also, exp(≠a(÷)) can be seen as a normalization param-
r LMS algorithm – By noting – the learning rate, the update rule of the Least Mean Squares eter that will make sure that the probabilities sum to one.
(LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff
learning rule, is as follows: Here are the most common exponential distributions summed up in the following table:
m
(i)
’j, ◊j Ω ◊j + – y (i) ≠ h◊ (x(i) ) xj
i=1
Distribution ÷ T (y) a(÷) b(y)
ÿ # $


Bernoulli log 1≠„
y log(1 + exp(÷)) 1
Remark: the update rule is a particular case of the gradient ascent.
2
÷2
! "

r LWR – Locally Weighted Regression, also known as LWR, is a variant of linear regression that Gaussian µ y 2
Ô1
2fi
exp ≠ y2
weights each training example in its cost function by w(i) (x), which is defined with parameter
1 2

· œ R as: 1
Poisson log(⁄) y e÷
y!
(x(i) ≠ x)2
w(i) (x) = exp e÷
Geometric y log 1

2· 2 log(1 ≠ „) 1≠e÷
3 4
! "

Classification and logistic regression r Assumptions of GLMs – Generalized Linear Models (GLM) aim at predicting a random
variable y as a function fo x œ Rn+1 and rely on the following 3 assumptions:
r Sigmoid function – The sigmoid function g, also known as the logistic function, is defined
as follows:
(1) y|x; ◊ ≥ ExpFamily(÷) (2) h◊ (x) = E[y|x; ◊] (3) ÷ = ◊T x
1
’z œ R, g(z) = œ]0,1[
1 + e≠z
Remark: ordinary least squares and logistic regression are special cases of generalized linear
models.
r Logistic regression – We assume here that y|x; ◊ ≥ Bernoulli(„). We have the following
form:

1 Support Vector Machines


„ = p(y = 1|x; ◊) = = g(◊T x)
1 + exp(≠◊T x)
The goal of support vector machines is to find the line that maximizes the minimum distance to
the line.
Remark: there is no closed form solution for the case of logistic regressions.
r Optimal margin classifier – The optimal margin classifier h is such that:
r Softmax regression – A softmax regression, also called a multiclass logistic regression, is
used to generalize logistic regression when there are more than 2 outcome classes. By convention,
we set ◊K = 0, which makes the Bernoulli parameter „i of each class i equal to: h(x) = sign(wT x ≠ b)

exp(◊iT x)
„i =
K
where (w, b) œ Rn ◊ R is the solution of the following optimization problem:

exp(◊jT x) 1
j=1
min ||w||2 such that y (i) (wT x(i) ≠ b) > 1
2
ÿ

Stanford University 2 Fall 2018


111
CS 229 – Machine Learning https://stanford.edu/~shervine

Gaussian Discriminant Analysis


r Setting – The Gaussian Discriminant Analysis assumes that y and x|y = 0 and x|y = 1 are
such that:

y ≥ Bernoulli(„)

x|y = 0 ≥ N (µ0 , ) and x|y = 1 ≥ N (µ1 , )

r Estimation – The following table sums up the estimates that we find when maximizing the
likelihood:


Remark: the line is defined as wT x ≠ b = 0 . m m
1 1
i=1 {y (i) =j}
x(i) 1
‚ ‚

m
µ‚j (j = 0,1)

r Hinge loss – The hinge loss is used in the setting of SVMs and is defined as follows:
(x(i) ≠ µy(i) )(x(i) ≠ µy(i) )T
m m
1{y(i) =1}
i=1 i=1 i=1
qm

1{y(i) =j}
ÿ ÿ

L(z,y) = [1 ≠ yz]+ = max(0,1 ≠ yz)


q

r Kernel – Given a feature mapping „, we define the kernel K to be defined as: Naive Bayes
K(x,z) = „(x)T „(z)
r Assumption – The Naive Bayes model supposes that the features of each data point are all
independent:
||x≠z||2
In practice, the kernel K defined by K(x,z) = exp is called the Gaussian kernel
n
≠ 2‡2
and is commonly used.
1 2

P (x|y) = P (x1 ,x2 ,...|y) = P (x1 |y)P (x2 |y)... = P (xi |y)
i=1
Ÿ

r Solutions – Maximizing the log-likelihood gives the following solutions, with k œ {0,1},
l œ [[1,L]]

(j)
1 #{j|y (j) = k and xi = l}
P (y = k) = ◊ #{j|y (j) = k} and P (xi = l|y = k) =
m #{j|y (j) = k}

Remark: we say that we use the "kernel trick" to compute the cost function using the kernel Remark: Naive Bayes is widely used for text classification and spam detection.
because we actually don’t need to know the explicit mapping „, which is often very complicated.
Instead, only the values K(x,z) are needed.
Tree-based and ensemble methods
r Lagrangian – We define the Lagrangian L(w,b) as follows:
l These methods can be used for both regression and classification problems.
L(w,b) = f (w) + —i hi (w) r CART – Classification and Regression Trees (CART), commonly known as decision trees,
i=1
can be represented as binary trees. They have the advantage to be very interpretable.
ÿ

Remark: the coefficients —i are called the Lagrange multipliers. r Random forest – It is a tree-based technique that uses a high number of decision trees
built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly
uninterpretable but its generally good performance makes it a popular algorithm.
Generative Learning Remark: random forests are a type of ensemble methods.
A generative model first tries to learn how the data is generated by estimating P (x|y), which r Boosting – The idea of boosting methods is to combine several weak learners to form a
we can then use to estimate P (y|x) by using Bayes’ rule. stronger one. The main ones are summed up in the table below:

Stanford University 3 Fall 2018


112
CS 229 – Machine Learning https://stanford.edu/~shervine

Adaptive boosting Gradient boosting


m
- High weights are put on errors to - Weak learners trained 1
improve at the next boosting step on remaining errors m
1{h(x(i) )”=y(i) }
i=1
- Known as Adaboost
ÿ
‚‘(h) =

r Probably Approximately Correct (PAC) – PAC is a framework under which numerous


Other non-parametric approaches results on learning theory were proved, and has the following set of assumptions:

r k-nearest neighbors – The k-nearest neighbors algorithm, commonly known as k-NN, is a • the training and testing sets follow the same distribution
non-parametric approach where the response of a data point is determined by the nature of its
k neighbors from the training set. It can be used in both classification and regression settings. • the training examples are drawn independently
Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the
higher the variance.
r Shattering – Given a set S = {x(1) ,...,x(d) }, and a set of classifiers H, we say that H shatters
S if for any set of labels {y (1) , ..., y (d) }, we have:

÷h œ H, ’i œ [[1,d]], h(x(i) ) = y (i)

r Upper bound theorem – Let H be a finite hypothesis class such that |H| = k and let ” and
the sample size m be fixed. Then, with probability of at least 1 ≠ ”, we have:

1 2k
‘(h) 6 min ‘(h) + 2 log
hœH 2m ”
1 2 Ú 1 2

r VC dimension – The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis


class H, noted VC(H) is the size of the largest set that is shattered by H.
Remark: the VC dimension of H = {set of linear classifiers in 2 dimensions} is 3.
Learning Theory
r Union bound – Let A1 , ..., Ak be k events. We have:

P (A1 fi ... fi Ak ) 6 P (A1 ) + ... + P (Ak )


r Theorem (Vapnik) – Let H be given, with VC(H) = d and m the number of training
examples. With probability at least 1 ≠ ”, we have:

d m 1 1
h) 6 min ‘(h) + O log + log
hœH m d m ”
1 2 3Ú 1 2 1 24
‘(‚

r Hoeffding inequality – Let Z1 , .., Zm be m iid variables drawn from a Bernoulli distribution
of parameter „. Let „
‚ be their sample mean and “ > 0 fixed. We have:

P (|„ ≠ „

Remark: this inequality is also known as the Chernoff bound.


‚| > “) 6 2 exp(≠2“ 2 m)

‘(h), also known as the


empirical risk or empirical error, to be as follows:
r Training error – For a given classifier h, we define the training error ‚

Stanford University 4 Fall 2018


113
Segmentation & Clustering R Cheat Sheet

Summary: K-Means
set.seed(0)
Common Applications in Business: Can be used for finding
segments within Customers, Companies, etc.  kmeans_obj <­ kmeans(X, centers = 4) 
Key Concept: Transform data into a matrix enabling trends to
be compared across units of measure (e.g. user-item matrix)
Gotchas: Data must be normalized or standardized to enable UMAP
comparison. This often requires calculation proportions of
values by customer, company, etc to ensure the larger values library(umap) 
do not dominate the trend mining operation.
How Many Components/Clusters? Use a Scree Plot to umap_obj <­ umap(X) 
determine the proportion of variance explained or total within
sum of squares

Combining K-Means & UMAP to visualize clusters by


Stock Price Movements

Python Cheat
Sheet

K-Means
from sklearn.cluster import KMeans

kmeans = KMeans(

    n_clusters=4, 

    random_state=0).fit(X) 

UMAP
import umap 

reducer = umap.UMAP() 

embedding = reducer.fit_transform(X) 

Resources
Business Analysis With R Course (DS4B 101-R) - Data Science Courses
Modeling - Week 6 for Business
Business Science Problem Framework 
Ultimate R Cheat Sheet | Ultimate Python Cheat Sheet
Business Science University
university.business­science.io
114

version: 1.0
Segmentation & Clustering
How to apply K­Means & UMAP step­by­step

Clustering Workflow
Collect Data Standardize / Normalize Spread to User-Item Format
K-Means
Obtain cluster assignments

UMAP
Make 2D Projection

K-Means: Scree Plot UMAP: 2D Projection Combine


Used to pick a value for K clusters for K-means algorithm. Fast dimensionality reduction algorithm that can be used Plot the K-Means cluster assignments with the UMAP 2D
for visualization.  Projection to obtain a visual.
Iteratively calculate "tot.withinss" for values of K.
Better than PCA - PCA is linear, UMAP is nonlinear Add interactivity to enable exploration.
Look for a bend.
Better than tSNE - tSNE is slow 

Business Science University
Data Science Courses for Business
university.business­science.io
115
CS 229 – Machine Learning https://stanford.edu/~shervine

VIP Cheatsheet: Unsupervised Learning

Afshine Amidi and Shervine Amidi

September 9, 2018

Introduction to Unsupervised Learning


k-means clustering
r Motivation – The goal of unsupervised learning is to find hidden patterns in unlabeled data
{x(1) ,...,x(m) }. We note c(i) the cluster of data point i and µj the center of cluster j.
r Jensen’s inequality – Let f be a convex function and X a random variable. We have the r Algorithm – After randomly initializing the cluster centroids µ1 ,µ2 ,...,µk œ Rn , the k-means
following inequality: algorithm repeats the following step until convergence:
E[f (X)] > f (E[X]) m

1{c(i) =j} x(i)


i=1
and µj =
ÿ

Expectation-Maximization c(i) = arg min||x(i) ≠ µj ||2 m


j

r Latent variables – Latent variables are hidden/unobserved variables that make estimation
1{c(i) =j}
problems difficult, and are often denoted z. Here are the most common settings where there are i=1
ÿ

latent variables:

Setting Latent variable z x|z Comments

Mixture of k Gaussians Multinomial(„) N (µj , j) µ j œ Rn , „ œ Rk

Factor analysis N (0,I) N (µ + z,Â) µj œ R n

r Algorithm – The Expectation-Maximization (EM) algorithm gives an efficient method at


estimating the parameter ◊ through maximum likelihood estimation by repeatedly constructing
a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:
r Distortion function – In order to see if the algorithm converges, we look at the distortion
function defined as follows:
• E-step: Evaluate the posterior probability Qi (z (i) ) that each data point x(i) came from
a particular cluster z (i) as follows: m
J(c,µ) = ||x(i) ≠ µc(i) ||2
(i) (i) (i)
Qi (z ) = P (z |x ; ◊) i=1
ÿ

• M-step: Use the posterior probabilities Qi (z (i) ) as cluster specific weights on data points
x(i) to separately re-estimate each cluster model as follows: Hierarchical clustering

r Algorithm – It is a clustering algorithm with an agglomerative hierarchical approach that


build nested clusters in a successive manner.
P (x(i) ,z (i) ; ◊)
◊i = argmax Qi (z (i) ) log dz (i) r Types – There are different sorts of hierarchical clustering algorithms that aims at optimizing
◊ z (i) Qi (z (i) )
i different objective functions, which is summed up in the table below:
ÿˆ 3 4

Stanford University 1 Fall 2018


116
CS 229 – Machine Learning https://stanford.edu/~shervine

Ward linkage Average linkage Complete linkage • Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
Minimize within cluster Minimize average distance Minimize maximum distance
distance between cluster pairs of between cluster pairs (i) m m
(i)
xj ≠ µj 1 (i) 1 (i)
xj Ω where µj = xj and ‡j2 = (xj ≠ µj )2
‡j m m
i=1 i=1
ÿ ÿ

Clustering assessment metrics


m
In an unsupervised learning setting, it is often hard to assess the performance of a model since 1 T
we don’t have the ground truth labels as was the case in the supervised learning setting. • Step 2: Compute = x(i) x(i) œ Rn◊n , which is symmetric with real eigenvalues.
m
r Silhouette coefficient – By noting a and b the mean distance between a sample and all i=1
ÿ

other points in the same class, and between a sample and all other points in the next nearest
cluster, the silhouette coefficient s for a single sample is defined as follows: • Step 3: Compute u1 , ..., uk œ Rn the k orthogonal principal eigenvectors of , i.e. the
orthogonal eigenvectors of the k largest eigenvalues.
b≠a
s= • Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
max(a,b)
among all k-dimensional spaces.

r Calinski-Harabaz index – By noting k the number of clusters, Bk and Wk the between


and within-clustering dispersion matrices respectively defined as

k m
Bk = nc(i) (µc(i) ≠ µ)(µc(i) ≠ µ)T , Wk = (x(i) ≠ µc(i) )(x(i) ≠ µc(i) )T
j=1 i=1
ÿ ÿ

the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such
that the higher the score, the more dense and well separated the clusters are. It is defined as
follows:

Tr(Bk )
s(k) = ◊
N ≠k
Independent component analysis
Tr(Wk ) k≠1
It is a technique meant to find the underlying generating sources.
r Assumptions – We assume that our data x has been generated by the n-dimensional source
Principal component analysis vector s = (s1 ,...,sn ), where si are independent random variables, via a mixing and non-singular
matrix A as follows:
It is a dimension reduction technique that finds the variance maximizing directions onto which x = As
to project the data.
The goal is to find the unmixing matrix W = A≠1 by an update rule.
r Eigenvalue, eigenvector – Given a matrix A œ Rn◊n , ⁄ is said to be an eigenvalue of A if
there exists a vector z œ Rn \{0}, called eigenvector, such that we have: r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by
following the steps below:
Az = ⁄z
• Write the probability of x = As = W ≠1 s as:

r Spectral theorem – Let A œ Rn◊n . If A is symmetric, then A is diagonalizable by a real n

orthogonal matrix U œ Rn◊n . By noting = diag(⁄1 ,...,⁄n ), we have: p(x) = ps (wiT x) · |W |


i=1
Ÿ

÷ diagonal, A = U UT
• Write the log likelihood given our training data {x(i) , i œ [[1,m]]} and by noting g the
Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of sigmoid function as:
matrix A.
m
r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction
technique that projects the data on k dimensions by maximizing the variance of the data as l(W ) = log g Õ
(wjT x(i) ) + log |W |
follows:
A n B

i=1 j=1
ÿ ÿ 1 2

Stanford University 2 Fall 2018


117
CS 229 – Machine Learning https://stanford.edu/~shervine

Therefore, the stochastic gradient ascent learning rule is such that for each training example
x(i) , we update W as follows:

1 ≠ 2g(w1T x(i) )
T (i)

.
QQ R R

..
T x(i) )
cc1 ≠ 2g(w2 x )d x(i) T + (W T )≠1 d

1 ≠ 2g(wn
W Ω≠ W + – aa b b

Stanford University 3 Fall 2018


118
CS 229 – Machine Learning https://stanford.edu/~shervine

VIP Cheatsheet: Machine Learning Tips r ROC – The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by
varying the threshold. These metrics are are summed up in the table below:

Metric Formula Equivalent


Afshine Amidi and Shervine Amidi
TP
True Positive Rate Recall, sensitivity
September 9, 2018 TP + FN
TPR
FP
False Positive Rate 1-specificity
TN + FP
Metrics FPR

Given a set of data points {x(1) , ..., x(m) }, where each x(i) has n features, associated to a set of
outcomes {y (1) , ..., y (m) }, we want to assess a given classifier that learns how to predict y from r AUC – The area under the receiving operating curve, also noted AUC or AUROC, is the
x. area below the ROC as shown in the following figure:

Classification
In a context of a binary classification, here are the main metrics that are important to track to
assess the performance of the model.
r Confusion matrix – The confusion matrix is used to have a more complete picture when
assessing the performance of a model. It is defined as follows:
Predicted class
+ –

TP FN
+ False Negatives Regression
True Positives
Type II error r Basic metrics – Given a regression model f , the following metrics are commonly used to
Actual class assess the performance of the model:
FP TN
– False Positives Total sum of squares Explained sum of squares Residual sum of squares
True Negatives
Type I error m m m
SStot = (yi ≠ y)2 SSreg = (f (xi ) ≠ y)2 SSres = (yi ≠ f (xi ))2
r Main metrics – The following metrics are commonly used to assess the performance of
i=1 i=1 i=1
classification models:
ÿ ÿ ÿ

Metric Formula Interpretation


r Coefficient of determination – The coefficient of determination, often noted R2 or r 2 ,
TP + TN provides a measure of how well the observed outcomes are replicated by the model and is defined
Accuracy Overall performance of model as follows:
TP + TN + FP + FN
SSres
TP R2 = 1 ≠
Precision How accurate the positive predictions are SStot
TP + FP
TP r Main metrics – The following metrics are commonly used to assess the performance of
Recall Coverage of actual positive sample
TP + FN regression models, by taking into account the number of variables n that they take into consid-
Sensitivity eration:
TN
Specificity Coverage of actual negative sample Mallow’s Cp AIC BIC Adjusted R2
TN + FP
2TP ‡2 (1 ≠ R2 )(m ≠ 1)
F1 score Hybrid metric useful for unbalanced classes 2 (n + 2) ≠ log(L) log(m)(n + 2) ≠ 2 log(L) 1≠
2TP + FP + FN m m≠n≠1
SSres + 2(n + 1)‚ # $

Stanford University 1 Fall 2018


119
CS 229 – Machine Learning https://stanford.edu/~shervine

‡ 2 is an estimate of the variance associated with each response. LASSO Ridge Elastic Net
- Shrinks coefficients to 0 Makes coefficients smaller Tradeoff between variable
- Good for variable selection selection and small coefficients
where L is the likelihood and ‚

Model selection
r Vocabulary – When selecting a model, we distinguish 3 different parts of the data that we
have as follows:

Training set Validation set Testing set


- Model is trained - Model is assessed - Model gives predictions
- Usually 80% of the dataset - Usually 20% of the dataset - Unseen data
- Also called hold-out
or development set

Once the model has been chosen, it is trained on the entire dataset and tested on the unseen ... + ⁄||◊||1 ... + ⁄||◊||22 ... + ⁄ (1 ≠ –)||◊||1 + –||◊||22
test set. These are represented in the figure below:
Ë È

⁄œR ⁄œR ⁄ œ R, – œ [0,1]

r Model selection – Train model on training set, then evaluate on the development set, then
pick best performance model on the development set, and retrain all of that model on the whole
training set.
r Cross-validation – Cross-validation, also noted CV, is a method that is used to select a
model that does not rely too much on the initial training set. The different types are summed
up in the table below:
Diagnostics
k-fold Leave-p-out
r Bias – The bias of a model is the difference between the expected prediction and the correct
- Training on k ≠ 1 folds and - Training on n ≠ p observations and model that we try to predict for given data points.
assessment on the remaining one assessment on the p remaining ones
r Variance – The variance of a model is the variability of the model prediction for given data
- Generally k = 5 or 10 - Case p = 1 is called leave-one-out points.

r Bias/variance tradeoff – The simpler the model, the higher the bias, and the more complex
The most commonly used method is called k-fold cross-validation and splits the training data the model, the higher the variance.
into k folds to validate the model on one fold while training the model on the k ≠ 1 other folds,
all of this k times. The error is then averaged over the k folds and is named cross-validation
error.
Underfitting Just right Overfitting
- High training error - Training error - Low training error
Symptoms - Training error close slightly lower than - Training error much
to test error test error lower than test error
- High bias - High variance

Regression

r Regularization – The regularization procedure aims at avoiding the model to overfit the
data and thus deals with high variance issues. The following table sums up the different types
of commonly used regularization techniques:

Stanford University 2 Fall 2018


120
CS 229 – Machine Learning https://stanford.edu/~shervine

Classification

Deep learning

- Complexify model - Regularize


Remedies - Add more features - Get more data
- Train longer

r Error analysis – Error analysis is analyzing the root cause of the difference in performance
between the current and the perfect models.
r Ablative analysis – Ablative analysis is analyzing the root cause of the difference in perfor-
mance between the current and the baseline models.

Stanford University 3 Fall 2018


121
Python For Data Science Cheat Sheet Model Architecture Inspect Model
Sequential Model >>> model.output_shape Model output shape
Keras >>> model.summary() Model summary representation
>>> from keras.models import Sequential >>> model.get_config() Model configuration
Learn Python for data science Interactively at www.DataCamp.com >>> model = Sequential() >>> model.get_weights() List all weight tensors in the model
>>> model2 = Sequential()
>>> model3 = Sequential() Compile Model
Multilayer Perceptron (MLP) MLP: Binary Classification
Keras Binary Classification >>> model.compile(optimizer='adam',
loss='binary_crossentropy',
Keras is a powerful and easy-to-use deep learning library for >>> from keras.layers import Dense metrics=['accuracy'])
Theano and TensorFlow that provides a high-level neural >>> model.add(Dense(12, MLP: Multi-Class Classification
input_dim=8, >>> model.compile(optimizer='rmsprop',
networks API to develop and evaluate deep learning models. kernel_initializer='uniform', loss='categorical_crossentropy',
activation='relu')) metrics=['accuracy'])
A Basic Example >>> model.add(Dense(8,kernel_initializer='uniform',activation='relu'))
MLP: Regression
>>> model.add(Dense(1,kernel_initializer='uniform',activation='sigmoid')) >>> model.compile(optimizer='rmsprop',
>>> import numpy as np loss='mse',
>>> from keras.models import Sequential Multi-Class Classification metrics=['mae'])
>>> from keras.layers import Dense >>> from keras.layers import Dropout
>>> data = np.random.random((1000,100)) >>> model.add(Dense(512,activation='relu',input_shape=(784,))) Recurrent Neural Network
>>> labels = np.random.randint(2,size=(1000,1)) >>> model.add(Dropout(0.2)) >>> model3.compile(loss='binary_crossentropy',
>>> model = Sequential() optimizer='adam',
>>> model.add(Dense(512,activation='relu')) metrics=['accuracy'])
>>> model.add(Dense(32, >>> model.add(Dropout(0.2))
activation='relu', >>> model.add(Dense(10,activation='softmax'))
input_dim=100))
>>> model.add(Dense(1, activation='sigmoid'))
Regression Model Training
>>> model.compile(optimizer='rmsprop', >>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1])) >>> model3.fit(x_train4,
loss='binary_crossentropy', >>> model.add(Dense(1)) y_train4,
metrics=['accuracy']) batch_size=32,
>>> model.fit(data,labels,epochs=10,batch_size=32) Convolutional Neural Network (CNN) epochs=15,
verbose=1,
>>> predictions = model.predict(data) >>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten validation_data=(x_test4,y_test4))
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
Also see NumPy, Pandas & Scikit-Learn >>> model2.add(Activation('relu'))
Data >>> model2.add(Conv2D(32,(3,3))) Evaluate Your Model's Performance
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ide- >>> model2.add(Activation('relu')) >>> score = model3.evaluate(x_test,
>>> model2.add(MaxPooling2D(pool_size=(2,2))) y_test,
ally, you split the data in training and test sets, for which you can also resort batch_size=32)
>>> model2.add(Dropout(0.25))
to the train_test_split module of sklearn.cross_validation.
>>> model2.add(Conv2D(64,(3,3), padding='same'))
Keras Data Sets >>> model2.add(Activation('relu')) Prediction
>>> model2.add(Conv2D(64,(3, 3)))
>>> from keras.datasets import boston_housing, >>> model2.add(Activation('relu')) >>> model3.predict(x_test4, batch_size=32)
mnist, >>> model2.add(MaxPooling2D(pool_size=(2,2))) >>> model3.predict_classes(x_test4,batch_size=32)
cifar10, >>> model2.add(Dropout(0.25))
imdb
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data() >>> model2.add(Flatten()) Save/ Reload Models
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data() >>> model2.add(Dense(512))
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> model2.add(Activation('relu')) >>> from keras.models import load_model
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> model2.add(Dropout(0.5)) >>> model3.save('model_file.h5')
>>> num_classes = 10 >>> my_model = load_model('my_model.h5')
>>> model2.add(Dense(num_classes))
>>> model2.add(Activation('softmax'))
Other
Recurrent Neural Network (RNN) Model Fine-tuning
>>> from urllib.request import urlopen
>>> data = np.loadtxt(urlopen("http://archive.ics.uci.edu/ >>> from keras.klayers import Embedding,LSTM Optimization Parameters
ml/machine-learning-databases/pima-indians-diabetes/
pima-indians-diabetes.data"),delimiter=",") >>> model3.add(Embedding(20000,128)) >>> from keras.optimizers import RMSprop
>>> X = data[:,0:8] >>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> opt = RMSprop(lr=0.0001, decay=1e-6)
>>> y = data [:,8] >>> model3.add(Dense(1,activation='sigmoid')) >>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Preprocessing Also see NumPy & Scikit-Learn
Early Stopping
Sequence Padding Train and Test Sets >>> from keras.callbacks import EarlyStopping
>>> from keras.preprocessing import sequence >>> from sklearn.model_selection import train_test_split >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, >>> model3.fit(x_train4,
>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) y,
test_size=0.33, y_train4,
random_state=42) batch_size=32,
One-Hot Encoding epochs=15,
>>> from keras.utils import to_categorical Standardization/Normalization validation_data=(x_test4,y_test4),
>>> Y_train = to_categorical(y_train, num_classes) >>> from sklearn.preprocessing import StandardScaler callbacks=[early_stopping_monitor])
>>> Y_test = to_categorical(y_test, num_classes) >>> scaler = StandardScaler().fit(x_train2)
122

>>> Y_train3 = to_categorical(y_train3, num_classes) >>> standardized_X = scaler.transform(x_train2) DataCamp


>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = scaler.transform(x_test2) Learn Python for Data Science Interactively
123
124
CS 229 – Machine Learning https://stanford.edu/~shervine

VIP Cheatsheet: Deep Learning r Learning rate – The learning rate, often noted ÷, indicates at which pace the weights get
updated. This can be fixed or adaptively changed. The current most popular method is called
Adam, which is a method that adapts the learning rate.

r Backpropagation – Backpropagation is a method to update the weights in the neural network


Afshine Amidi and Shervine Amidi by taking into account the actual output and the desired output. The derivative with respect
to weight w is computed using chain rule and is of the following form:
September 15, 2018 ˆL(z,y) ˆL(z,y) ˆa ˆz
= ◊ ◊
ˆw ˆa ˆz ˆw

Neural Networks As a result, the weight is updated as follows:

Neural networks are a class of models that are built with layers. Commonly used types of neural ˆL(z,y)
networks include convolutional and recurrent neural networks. w Ω≠ w ≠ ÷
ˆw
r Architecture – The vocabulary around neural networks architectures is described in the
figure below:
r Updating weights – In a neural network, weights are updated as follows:

• Step 1: Take a batch of training data.

• Step 2: Perform forward propagation to obtain the corresponding loss.

• Step 3: Backpropagate the loss to get the gradients.


By noting i the ith layer of the network and j the j th hidden unit of the layer, we have:
• Step 4: Use the gradients to update the weights of the network.
[i] [i] T [i]
zj = w j x + bj

where we note w, b, z the weight, bias and output respectively. r Dropout – Dropout is a technique meant at preventing overfitting the training data by
dropping out units in a neural network. In practice, neurons are either dropped with probability
r Activation function – Activation functions are used at the end of a hidden unit to introduce p or kept with probability 1 ≠ p.
non-linear complexities to the model. Here are the most common ones:

Sigmoid Tanh ReLU Leaky ReLU Convolutional Neural Networks


1 ez ≠ e≠z r Convolutional layer requirement – By noting W the input volume size, F the size of the
g(z) = g(z) = g(z) = max(0,z) g(z) = max(‘z,z)
1 + e≠z ez + e≠z convolutional layer neurons, P the amount of zero padding, then the number of neurons N that
with ‘ π 1 fit in a given volume is such that:

W ≠ F + 2P
N = +1
S

r Batch normalization – It is a step of hyperparameter “, — that normalizes the batch {xi }.


2 the mean and variance of that we want to correct to the batch, it is done as
By noting µB , ‡B
follows:

xi ≠ µ B
+—
r Cross-entropy loss – In the context of neural networks, the cross-entropy loss L(z,y) is 2 +‘
‡B
commonly used and is defined as follows:
xi Ω≠ “ 

L(z,y) = ≠ y log(z) + (1 ≠ y) log(1 ≠ z) It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
Ë È

Stanford University 1 Fall 2018


125
CS 229 – Machine Learning https://stanford.edu/~shervine

Recurrent Neural Networks • We initialize the value:

r Types of gates – Here are the different types of gates that we encounter in a typical recurrent V0 (s) = 0
neural network:
• We iterate the value based on the values before:
Input gate Forget gate Output gate Gate
Write to cell or not? Erase a cell or not? Reveal a cell or not? How much writing? Õ Õ
Vi+1 (s) = R(s) + max “Psa (s )Vi (s )
aœA
C D

sÕ œS
ÿ

r LSTM – A long short-term memory (LSTM) network is a type of RNN model that avoids
the vanishing gradient problem by adding ’forget’ gates.
r Maximum likelihood estimate – The maximum likelihood estimates for the state transition
probabilities are as follows:
Reinforcement Learning and Control
#times took action a in state s and got to sÕ
Psa (sÕ ) =
The goal of reinforcement learning is for an agent to learn how to evolve in an environment. #times took action a in state s
r Markov decision processes – A Markov decision process (MDP) is a 5-tuple (S,A,{Psa },“,R)
where: r Q-learning – Q-learning is a model-free estimation of Q, which is done as follows:

• S is the set of states


Q(s,a) Ω Q(s,a) + – R(s,a,sÕ ) + “ max Q(sÕ ,aÕ ) ≠ Q(s,a)

• A is the set of actions
Ë È

• {Psa } are the state transition probabilities for s œ S and a œ A

• “ œ [0,1[ is the discount factor

• R : S ◊ A ≠æ R or R : S ≠æ R is the reward function that the algorithm wants to


maximize

r Policy – A policy fi is a function fi : S ≠æ A that maps states to actions.


Remark: we say that we execute a given policy fi if given a state s we take the action a = fi(s).

r Value function – For a given policy fi and a given state s, we define the value function V fi
as follows:

V fi (s) = E R(s0 ) + “R(s1 ) + “ 2 R(s2 ) + ...|s0 = s,fi


Ë È

ú
r Bellman equation – The optimal Bellman equations characterizes the value function V fi
of the optimal policy fi ú :
ú ú
V fi (s) = R(s) + max “ Psa (sÕ )V fi (sÕ )
aœA
sÕ œS
ÿ

Remark: we note that the optimal policy fi ú for a given state s is such that:

fi ú (s) = argmax Psa (sÕ )V ú (sÕ )


aœA
sÕ œS
ÿ

r Value iteration algorithm – The value iteration algorithm is in two steps:

Stanford University 2 Fall 2018


126
CS 230 – Deep Learning https://stanford.edu/~shervine

Advantages Drawbacks
VIP Cheatsheet: Recurrent Neural Networks - Possibility of processing input of any length - Computation being slow
- Model size not increasing with size of input - Difficulty of accessing information
- Computation takes into account from a long time ago
Afshine Amidi and Shervine Amidi historical information - Cannot consider any future input
- Weights are shared across time for the current state

November 26, 2018


r Applications of RNNs – RNN models are mostly used in the fields of natural language
processing and speech recognition. The different applications are summed up in the table below:

Overview Type of RNN Illustration Example

r Architecture of a traditional RNN – Recurrent neural networks, also known as RNNs,


are a class of neural networks that allow previous outputs to be used as inputs while having One-to-one
hidden states. They are typically as follows:
Traditional neural network
Tx = Ty = 1

One-to-many
Music generation
Tx = 1, Ty > 1

For each timestep t, the activation a<t> and the output y <t> are expressed as follows:

Many-to-one
a<t> = g1 (Waa a<t≠1> + Wax x<t> + ba ) and y <t> = g2 (Wya a<t> + by ) Sentiment classification
Tx > 1, Ty = 1
where Wax , Waa , Wya , ba , by are coefficients that are shared temporally and g1 , g2 activation
functions

Many-to-many
Name entity recognition
Tx = Ty

Many-to-many
Machine translation
Tx ”= Ty

The pros and cons of a typical RNN architecture are summed up in the table below: r Loss function – In the case of a recurrent neural network, the loss function L of all time

Stanford University 1 Winter 2019


127
CS 230 – Deep Learning https://stanford.edu/~shervine

steps is defined based on the loss at every time step as follows: r Types of gates – In order to remedy the vanishing gradient problem, specific gates are used
in some types of RNNs and usually have a well-defined purpose. They are usually noted and
Ty are equal to:
y ,y) = y <t> ,y <t> )
t=1
= ‡(W x<t> + U a<t≠1> + b)
ÿ
L(‚ L(‚

r Backpropagation through time – Backpropagation is done at each point in time. At


where W, U, b are coefficients specific to the gate and ‡ is the sigmoid function. The main ones
timestep T , the derivative of the loss L with respect to weight matrix W is expressed as follows:
are summed up in the table below:
T
ˆL(T )
=
ˆW
t=1 (t)
-

Type of gate Role Used in


ÿ ˆL(T ) -
-
ˆW -

Update gate u How much past should matter now? GRU, LSTM
Relevance gate r Drop previous information? GRU, LSTM
Handling long term dependencies
Forget gate f Erase a cell or not? LSTM
r Commonly used activation functions – The most common activation functions used in
RNN modules are described below: Output gate o How much to reveal of a cell? LSTM

Sigmoid Tanh RELU


r GRU/LSTM – Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM)
1 ez e≠z
deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being

g(z) = g(z) = g(z) = max(0,z)
1 + e≠z ez + e≠z a generalization of GRU. Below is a table summing up the characterizing equations of each
architecture:

Gated Recurrent Unit Long Short-Term Memory


(GRU) (LSTM)

c̃<t> tanh(Wc [ r ı a<t≠1> ,x<t> ] + bc ) tanh(Wc [ r ı a<t≠1> ,x<t> ] + bc )

c<t> u ı c̃<t> + (1 ≠ u) ı c<t≠1> u ı c̃<t> + f ı c<t≠1>

r Vanishing/exploding gradient – The vanishing and exploding gradient phenomena are a<t> c<t> o ı c<t>
often encountered in the context of RNNs. The reason why they happen is that it is difficult
to capture long term dependencies because of multiplicative gradient that can be exponentially
decreasing/increasing with respect to the number of layers.

r Gradient clipping – It is a technique used to cope with the exploding gradient problem
sometimes encountered when performing backpropagation. By capping the maximum value for Dependencies
the gradient, this phenomenon is controlled in practice.

Remark: the sign ı denotes the element-wise multiplication between two vectors.

r Variants of RNNs – The table below sums up the other commonly used RNN architectures:

Stanford University 2 Winter 2019


128
CS 230 – Deep Learning https://stanford.edu/~shervine

Bidirectional Deep
(BRNN) (DRNN)

r Skip-gram – The skip-gram word2vec model is a supervised learning task that learns word
embeddings by assessing the likelihood of any given target word t happening with a context
word c. By noting ◊t a parameter associated with t, the probability P (t|c) is given by:

exp(◊tT ec )
P (t|c) =
|V |

exp(◊jT ec )
j=1
ÿ

Learning word representation


Remark: summing over the whole vocabulary in the denominator of the softmax part makes
In this section, we note V the vocabulary and |V | its size. this model computationally expensive. CBOW is another word2vec model using the surrounding
r Representation techniques – The two main ways of representing words are summed up in words to predict a given word.
the table below:
r Negative sampling – It is a set of binary classifiers using logistic regressions that aim at
assessing how a given context and a given target words are likely to appear simultaneously, with
the models being trained on sets of k negative examples and 1 positive example. Given a context
1-hot representation Word embedding word c and a target word t, the prediction is expressed by:

P (y = 1|c,t) = ‡(◊tT ec )

Remark: this method is less computationally expensive than the skip-gram model.

r GloVe – The GloVe model, short for global vectors for word representation, is a word em-
bedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of
times that a target i occurred with a context j. Its cost function J is as follows:

|V |
1
- Noted ow - Noted ew J(◊) = f (Xij )(◊iT ej + bi + bÕj ≠ log(Xij ))2
2
- Naive approach, no similarity information - Takes into account words similarity i,j=1
ÿ

here f is a weighting function such that Xi,j = 0 =∆ f (Xi,j ) = 0.


r Embedding matrix – For a given word w, the embedding matrix E is a matrix that maps (final)
Given the symmetry that e and ◊ play in this model, the final word embedding ew is given
its 1-hot representation ow to its embedding ew as follows: by:
ew = Eow
(final) ew + ◊w
ew =
2
Remark: learning the embedding matrix can be done using target/context likelihood models.

r Word2vec – Word2vec is a framework aimed at learning word embeddings by estimating the Remark: the individual components of the learned word embeddings are not necessarily inter-
likelihood that a given word is surrounded by other words. Popular models include skip-gram, pretable.
negative sampling and CBOW.

Stanford University 3 Winter 2019


129
CS 230 – Deep Learning https://stanford.edu/~shervine

Comparing words r Beam search – It is a heuristic search algorithm used in machine translation and speech
recognition to find the likeliest sentence y given an input x.
r Cosine similarity – The cosine similarity between words w1 and w2 is expressed as follows:
w1 · w2 • Step 1: Find top B likely words y <1>
similarity = = cos(◊)
||w1 || ||w2 ||
• Step 2: Compute conditional probabilities y <k> |x,y <1> ,...,y <k≠1>

Remark: ◊ is the angle between words w1 and w2 . • Step 3: Keep top B combinations x,y <1> ,...,y <k>

r t-SNE – t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at re-


ducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly
used to visualize word vectors in the 2D space.

Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.
r Beam width – The beam width B is a parameter for beam search. Large values of B yield
to better result but with slower performance and increased memory. Small values of B lead to
worse results but is less computationally intensive. A standard value for B is around 10.
r Length normalization – In order to improve numerical stability, beam search is usually ap-
plied on the following normalized objective, often called the normalized log-likelihood objective,
defined as:
Ty
1
Language model Objective = log p(y <t> |x,y <1> , ..., y <t≠1> )
Ty–
t=1
ÿ Ë È

r Overview – A language model aims at estimating the probability of a sentence P (y).


r n-gram model – This model is a naive approach aiming at quantifying the probability that Remark: the parameter – can be seen as a softener, and its value is usually between 0.5 and 1.
an expression appears in a corpus by counting its number of appearance in the training data.
y that is bad, one can wonder why
r Perplexity – Language models are commonly assessed using the perplexity metric, also we did not get a good translation y ú by performing the following error analysis:
known as PP, which can be interpreted as the inverse probability of the dataset normalized by
the number of words T . The perplexity is such that the lower, the better and is defined as
r Error analysis – When obtaining a predicted translation ‚

follows: Case y |x) y |x)


T Root cause Beam search faulty RNN faulty
1
PP =
P (y ú |x) > P (‚ P (y ú |x) 6 P (‚

(t) (t) - Try different architecture


yj
A B T1

t=1 j=1 Remedies Increase beam width - Regularize


Ÿ

- Get more data


q|V |

Remark: PP is commonly used in t-SNE.


yj · ‚

r Bleu score – The bilingual evaluation understudy (bleu) score quantifies how good a machine
Machine translation translation is by computing a similarity score based on n-gram precision. It is defined as follows:
r Overview – A machine translation model is similar to a language model except it has an n
encoder network placed before. For this reason, it is sometimes referred as a conditional language 1
bleu score = exp pk
model. The goal is to find a sentence y such that: n
A B

k=1
ÿ

y= arg max P (y <1> ,...,y <Ty > |x)


y <1> ,...,y <Ty > where pn is the bleu score on n-gram only defined as follows:

Stanford University 4 Winter 2019


130
CS 230 – Deep Learning https://stanford.edu/~shervine

countclip (n-gram)
n-gramœy
ÿ

pn =
count(n-gram)

n-gramœy
ÿ

Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially
inflated bleu score.

Attention
r Attention model – This model allows an RNN to pay attention to specific parts of the input
that is considered as being important, which improves the performance of the resulting model
Õ
in practice. By noting –<t,t > the amount of attention that the output y <t> should pay to the
activation a <tÕ
> and c <t> the context at time t, we have:
Õ
> <tÕ > Õ
>
c<t> = –<t,t a with –<t,t =1
tÕ tÕ
ÿ ÿ

Remark: the attention scores are commonly used in image captioning and machine translation.

r Attention weight – The amount of attention that the output y <t> should pay to the
Õ Õ
activation a<t > is given by –<t,t > computed as follows:
Õ
Õ
> exp(e<t,t >)
–<t,t =
Tx
ÕÕ
>
exp(e<t,t )
tÕÕ =1
ÿ

Remark: computation complexity is quadratic with respect to Tx .

ı ı ı

Stanford University 5 Winter 2019


131
CS 230 – Deep Learning https://stanford.edu/~shervine

Max pooling Average pooling


VIP Cheatsheet: Convolutional Neural Networks Each pooling operation selects the Each pooling operation averages
Purpose
maximum value of the current view the values of the current view

Afshine Amidi and Shervine Amidi


Illustration

November 26, 2018

- Preserves detected features - Downsamples feature map


Comments
- Most commonly used - Used in LeNet
Overview

r Architecture of a traditional CNN – Convolutional neural networks, also known as CNNs, r Fully Connected (FC) – The fully connected layer (FC) operates on a flattened input where
are a specific type of neural networks that are generally composed of the following layers: each input is connected to all neurons. If present, FC layers are usually found towards the end
of CNN architectures and can be used to optimize objectives such as class scores.

The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters
that are described in the next sections.
Filter hyperparameters
The convolution layer contains filters for which it is important to know the meaning behind its
Types of layer hyperparameters.

r Convolutional layer (CONV) – The convolution layer (CONV) uses filters that perform r Dimensions of a filter – A filter of size F ◊ F applied to an input containing C channels is
convolution operations as it is scanning the input I with respect to its dimensions. Its hyperpa- a F ◊ F ◊ C volume that performs convolutions on an input of size I ◊ I ◊ C and produces an
rameters include the filter size F and stride S. The resulting output O is called feature map or output feature map (also called activation map) of size O ◊ O ◊ 1.
activation map.

Remark: the application of K filters of size F ◊ F results in an output feature map of size
O ◊ O ◊ K.
Remark: the convolution step can be generalized to the 1D and 3D cases as well. r Stride – For a convolutional or a pooling operation, the stride S denotes the number of pixels
by which the window moves after each operation.
r Pooling (POOL) – The pooling layer (POOL) is a downsampling operation, typically applied
after a convolution layer, which does some spatial invariance. In particular, max and average
pooling are special kinds of pooling where the maximum and average value is taken, respectively.

Stanford University 1 Winter 2019


132
CS 230 – Deep Learning https://stanford.edu/~shervine

r Zero-padding – Zero-padding denotes the process of adding P zeroes to each side of the CONV POOL FC
boundaries of the input. This value can either be manually specified or automatically set through
one of the three modes detailed below:

Valid Same Full Illustration


SÁ S
I Ë≠I+F ≠S
Pstart = 2 Pstart œ [[0,F ≠ 1]]
Value P =0
Í Î

SÁ S
I Ë≠I+F ≠S
Input size Nin
Pend = 2
Pend = F ≠ 1 I ◊I ◊C I ◊I ◊C
Output size Nout
Ï Ì

O◊O◊K O◊O◊C
Number of
0
parameters
(F ◊ F ◊ C + 1) · K (Nin + 1) ◊ Nout
Illustration
- One bias parameter - Input is flattened
per filter - Pooling operation - One bias parameter
done channel-wise per neuron
Remarks - In most cases, S < F
- The number of FC
- Maximum padding - A common choice
- No padding - Padding such that feature - In most cases, S = F neurons is free of
such that end for K is 2C
I structural constraints
map size has size S convolutions are
- Drops last
Purpose applied on the limits
convolution if - Output size is
Ï Ì

of the input
dimensions do not mathematically convenient
match - Filter ’sees’ the input r Receptive field – The receptive field at layer k is the area denoted Rk ◊ Rk of the input
- Also called ’half’ padding that each pixel of the k-th activation map can ’see’. By calling Fj the filter size of layer j and
end-to-end
Si the stride value of layer i and with the convention S0 = 1, the receptive field at layer k can
be computed with the formula:

Tuning hyperparameters k j≠1

r Parameter compatibility in convolution layer – By noting I the length of the input Rk = 1 + (Fj ≠ 1) Si
volume size, F the length of the filter, P the amount of zero padding, S the stride, then the j=1 i=0
ÿ Ÿ

output size O of the feature map along that dimension is given by:

In the example below, we have F1 = F2 = 3 and S1 = S2 = 1, which gives R2 = 1+2 · 1+2 · 1 =


O=
I ≠ F + Pstart + Pend
+1 5.
S

Remark: often times, Pstart = Pend , P , in which case we can replace Pstart + Pend by 2P in Commonly used activation functions
the formula above.
r Understanding the complexity of the model – In order to assess the complexity of a r Rectified Linear Unit – The rectified linear unit layer (ReLU) is an activation function g
model, it is often useful to determine the number of parameters that its architecture will have. that is used on all elements of the volume. It aims at introducing non-linearities to the network.
In a given layer of a convolutional neural network, it is done as follows: Its variants are summarized in the table below:

Stanford University 2 Winter 2019


133
CS 230 – Deep Learning https://stanford.edu/~shervine

ReLU Leaky ReLU ELU Bounding box detection Landmark detection


g(z) = max(‘z,z) g(z) = max(–(ez - Detects a shape or characteristics of
g(z) = max(0,z) Detects the part of the image where
≠ 1),z)
an object (e.g. eyes)
the object is located
with ‘ π 1 with – π 1
- More granular

Non-linearity complexities Addresses dying ReLU


Differentiable everywhere
biologically interpretable issue for negative values
Box of center (bx ,by ), height bh
Reference points (l1x ,l1y ), ...,(lnx ,lny )
r Softmax – The softmax step can be seen as a generalized logistic function that takes as input and width bw
a vector of scores x œ Rn and outputs a vector of output probability p œ Rn through a softmax
function at the end of the architecture. It is defined as follows:
r Intersection over Union – Intersection over Union, also known as IoU, is a function that
exi quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding
p= .. where pi = n box Ba . It is defined as:
.
pn
3 p1 4

exj Bp fl Ba
j=1
IoU(Bp ,Ba ) =
ÿ

Bp fi Ba

Object detection
r Types of models – There are 3 main types of object recognition algorithms, for which the
nature of what is predicted is different. They are described in the table below:

Classification
Image classification Detection
w. localization

Remark: we always have IoU œ [0,1]. By convention, a predicted bounding box Bp is considered
as being reasonably good if IoU(Bp ,Ba ) > 0.5.

r Anchor boxes – Anchor boxing is a technique used to predict overlapping bounding boxes.
In practice, the network is allowed to predict more than one box simultaneously, where each box
- Classifies a picture - Detects object in a picture - Detects up to several objects prediction is constrained to have a given set of geometrical properties. For instance, the first
in a picture prediction can potentially be a rectangular box of a given form, while the second will be another
- Predicts probability of
rectangular box of a different geometrical form.
- Predicts probability object and where it is - Predicts probabilities of objects
of object located and where they are located r Non-max suppression – The non-max suppression technique aims at removing duplicate
overlapping bounding boxes of a same object by selecting the most representative ones. After
Traditional CNN Simplified YOLO, R-CNN YOLO, R-CNN having removed all boxes having a probability prediction lower than 0.6, the following steps are
repeated while there are boxes remaining:

r Detection – In the context of object detection, different methods are used depending on • Step 1: Pick the box with the largest prediction probability.
whether we just want to locate the object or detect a more complex shape in the image. The
two main ones are summed up in the table below: • Step 2: Discard any box having an IoU > 0.5 with the previous box.

Stanford University 3 Winter 2019


134
CS 230 – Deep Learning https://stanford.edu/~shervine

Face verification and recognition

r Types of models – Two main types of model are summed up in table below:

Face verification Face recognition


- Is this the correct person? - Is this one of the K persons in the database?
- One-to-one lookup - One-to-many lookup

r YOLO – You Only Look Once (YOLO) is an object detection algorithm that performs the
following steps:

• Step 1: Divide the input image into a G ◊ G grid.

• Step 2: For each grid cell, run a CNN that predicts y of the following form:

y = pc ,bx ,by ,bh ,bw ,c1 ,c2 ,...,cp ,... œ RG◊G◊k◊(5+p)


repeated k times
# $T
¸ ˚˙ ˝

where pc is the probability of detecting an object, bx ,by ,bh ,bw are the properties of the
detected bouding box, c1 ,...,cp is a one-hot representation of which of the p classes were r One Shot Learning – One Shot Learning is a face verification algorithm that uses a limited
detected, and k is the number of anchor boxes. training set to learn a similarity function that quantifies how different two given images are. The
similarity function applied to two images is often noted d(image 1, image 2).
• Step 3: Run the non-max suppression algorithm to remove any potential duplicate over-
lapping bounding boxes. r Siamese Network – Siamese Networks aim at learning how to encode images to then quantify
how different two images are. For a given input image x(i) , the encoded output is often noted
as f (x(i) ).

r Triplet loss – The triplet loss ¸ is a loss function computed on the embedding representation
of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive
example belong to a same class, while the negative example to another one. By calling – œ R+
the margin parameter, this loss is defined as follows:

¸(A,P,N ) = max (d(A,P ) ≠ d(A,N ) + –,0)

Remark: when pc = 0, then the network does not detect any object. In that case, the corre-
sponding predictions bx , ..., cp have to be ignored.

r R-CNN – Region with Convolutional Neural Networks (R-CNN) is an object detection algo-
rithm that first segments the image to find potential relevant bounding boxes and then run the
detection algorithm to find most probable objects in those bounding boxes.

Neural style transfer

Remark: although the original algorithm is computationally expensive and slow, newer archi- r Motivation – The goal of neural style transfer is to generate an image G based on a given
tectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN. content C and a given style S.

Stanford University 4 Winter 2019


135
CS 230 – Deep Learning https://stanford.edu/~shervine

r Activation – In a given layer l, the activation is noted a[l] and is of dimensions nH ◊ nw ◊ nc


r Content cost function – The content cost function Jcontent (C,G) is used to determine how Remark: use cases using variants of GANs include text to image, music generation and syn-
the generated image G differs from the original content image C. It is defined as follows: thesis.

1 [l](C) r ResNet – The Residual Network architecture (also called ResNet) uses residual blocks with a
Jcontent (C,G) = ||a ≠ a[l](G) ||2 high number of layers meant to decrease the training error. The residual block has the following
2 characterizing equation:
a[l+2] = g(a[l] + z [l+2] )
r Style matrix – The style matrix G[l] of a given layer l is a Gram matrix where each of its
[l]
elements GkkÕ quantifies how correlated the channels k and kÕ are. It is defined with respect to
r Inception Network – This architecture uses inception modules and aims at giving a try
activations a[l] as follows: at different convolutions in order to increase its performance. In particular, it uses the 1 ◊ 1
[l]
convolution trick to lower the burden of computation.
n [l]
nw
[l] [l] [l]
GkkÕ = aijk aijkÕ ı ı ı
i=1 j=1
ÿ
H ÿ

Remark: the style matrix for the style image and the generated image are noted G[l](S) and
G[l](G) respectively.
r Style cost function – The style cost function Jstyle (S,G) is used to determine how the
generated image G differs from the style S. It is defined as follows:

[l] 1 1 [l](S) [l](G)


Jstyle (S,G) = ||G[l](S) ≠ G[l](G) ||2F = GkkÕ ≠ GkkÕ
(2nH nw nc )2 (2nH nw nc )2
k,kÕ =1
nc 1 22
ÿ

r Overall cost function – The overall cost function is defined as being a combination of the
content and style cost functions, weighted by parameters –,—, as follows:

J(G) = –Jcontent (C,G) + —Jstyle (S,G)

Remark: a higher value of – will make the model care more about the content while a higher
value of — will make it care more about the style.

Architectures using computational tricks


r Generative Adversarial Network – Generative adversarial networks, also known as GANs,
are composed of a generative and a discriminative model, where the generative model aims at
generating the most truthful output that will be fed into the discriminative which aims at
differentiating the generated and true image.

Stanford University 5 Winter 2019


136
CS 230 – Deep Learning https://stanford.edu/~shervine

VIP Cheatsheet: Tips and Tricks


xi ≠ µ B
+—
2 +‘
‡B
Afshine Amidi and Shervine Amidi
xi Ω≠ “ 

November 26, 2018 It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.

Data processing Training a neural network


r Data augmentation – Deep learning models usually need a lot of data to be properly trained.
It is often useful to get more data from the existing ones using data augmentation techniques. r Epoch – In the context of training a model, epoch is a term used to refer to one iteration
The main ones are summed up in the table below. More precisely, given the following input where the model sees the whole training set to update its weights.
image, here are the techniques that we can apply:
r Mini-batch gradient descent – During the training phase, updating weights is usually not
based on the whole training set at once due to computation complexities or one data point due
Original Flip Rotation Random crop to noise issues. Instead, the update step is done on mini-batches, where the number of data
points in a batch is a hyperparameter that we can tune.

r Loss function – In order to quantify how a given model performs, the loss function L is
usually used to evaluate to what extent the actual outputs y are correctly predicted by the
model outputs z.

r Cross-entropy loss – In the context of binary classification in neural networks, the cross-
entropy loss L(z,y) is commonly used and is defined as follows:

- Random focus
L(z,y) = ≠ y log(z) + (1 ≠ y) log(1 ≠ z)
- Flipped with respect - Rotation with on one part of
Ë È

- Image without to an axis for which a slight angle the image


the meaning of the - Simulates incorrect - Several random r Backpropagation – Backpropagation is a method to update the weights in the neural network
any modification by taking into account the actual output and the desired output. The derivative with respect
image is preserved horizon calibration crops can be
to each weight w is computed using the chain rule.
done in a row

Color shift Noise addition Information loss Contrast change

Using this method, each weight is updated with the rule:

ˆL(z,y)
w Ω≠ w ≠ –
ˆw
- Nuances of RGB
- Addition of noise - Parts of image - Luminosity changes
is slightly changed
- More tolerance to ignored - Controls difference
- Captures noise r Updating weights – In a neural network, weights are updated as follows:
quality variation of - Mimics potential in exposition due
that can occur
inputs loss of parts of image to time of day
with light exposure • Step 1: Take a batch of training data and perform forward propagation to compute the
loss.

r Batch normalization – It is a step of hyperparameter “, — that normalizes the batch {xi }. • Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight.
2 the mean and variance of that we want to correct to the batch, it is done as
By noting µB , ‡B
follows: • Step 3: Use the gradients to update the weights of the network.

Stanford University 1 Winter 2019


137
CS 230 – Deep Learning https://stanford.edu/~shervine

Method Explanation Update of w Update of b


- Dampens oscillations
Momentum - Improvement to SGD w ≠ –vdw b ≠ –vdb
- 2 parameters to tune
- Root Mean Square propagation
dw db
RMSprop - Speeds up learning algorithm w ≠ –Ô b Ω≠ b ≠ – Ô
sdw sdb
by controlling oscillations
- Adaptive Moment estimation
vdw vdb
Parameter tuning Adam - Most popular method w ≠ –Ô b Ω≠ b ≠ – Ô
sdw + ‘ sdb + ‘
- 4 parameters to tune
r Xavier initialization – Instead of initializing the weights in a purely random manner, Xavier
initialization enables to have initial weights that take into account characteristics that are unique Remark: other methods include Adadelta, Adagrad and SGD.
to the architecture.

r Transfer learning – Training a deep learning model requires a lot of data and more impor-
tantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets Regularization
that took days/weeks to train, and leverage it towards our use case. Depending on how much
data we have at hand, here are the different ways to leverage this: r Dropout – Dropout is a technique used in neural networks to prevent overfitting the training
data by dropping out neurons with probability p > 0. It forces the model to avoid relying too
much on particular sets of features.

Training size Illustration Explanation

Freezes all layers,


Small
trains weights on softmax

Remark: most deep learning frameworks parametrize dropout through the ’keep’ parameter 1≠p.

r Weight regularization – In order to make sure that the weights are not too large and that
Freezes most layers, the model is not overfitting the training set, regularization techniques are usually performed on
Medium trains weights on last the model weights. The main ones are summed up in the table below:
layers and softmax

LASSO Ridge Elastic Net

Trains weights on layers - Shrinks coefficients to 0 Tradeoff between variable


Makes coefficients smaller
Large and softmax by initializing - Good for variable selection selection and small coefficients
weights on pre-trained ones

r Learning rate – The learning rate, often noted – or sometimes ÷, indicates at which pace the
weights get updated. It can be fixed or adaptively changed. The current most popular method
is called Adam, which is a method that adapts the learning rate.

r Adaptive learning rates – Letting the learning rate vary when training a model can reduce
the training time and improve the numerical optimal solution. While Adam optimizer is the ... + ⁄||◊||1 ... + ⁄||◊||22 ... + ⁄ (1 ≠ –)||◊||1 + –||◊||22
most commonly used technique, others can also be useful. They are summed up in the table
Ë È

⁄œR
below:
⁄œR
⁄ œ R,– œ [0,1]

Stanford University 2 Winter 2019


138
CS 230 – Deep Learning https://stanford.edu/~shervine

r Early stopping – This regularization technique stops the training process as soon as the
validation loss reaches a plateau or starts to increase.

Good practices
r Overfitting small batch – When debugging a model, it is often useful to make quick tests
to see if there is any major issue with the architecture of the model itself. In particular, in order
to make sure that the model can be properly trained, a mini-batch is passed inside the network
to see if it can overfit on it. If it cannot, it means that the model is either too complex or not
complex enough to even overfit on a small batch, let alone a normal-sized training set.
r Gradient checking – Gradient checking is a method used during the implementation of
the backward pass of a neural network. It compares the value of the analytical gradient to the
numerical gradient at given points and plays the role of a sanity-check for correctness.

Numerical gradient Analytical gradient


df f (x + h) ≠ f (x ≠ h) df
Formula (x) ¥ (x) = f Õ (x)
dx 2h dx
- Expensive; loss has to be
computed two times per dimension
- ’Exact’ result
- Used to verify correctness
Comments of analytical implementation - Direct computation
-Trade-off in choosing h
not too small (numerical instability) - Used in the final implementation
nor too large (poor gradient approx.)

ı ı ı

Stanford University 3 Winter 2019


139
140
DASK FOR PARALLEL COMPUTING CHEAT SHEET
See full Dask documentation at: http://dask.pydata.org/
These instructions use the conda environment manager. Get yours at http://bit.ly/getconda

DASK QUICK INSTALL


Install Dask with conda conda install dask
Install Dask with pip pip install dask[complete]
DASK COLLECTIONS EASY TO USE BIG DATA COLLECTIONS
DASK DATAFRAMES PARALLEL PANDAS DATAFRAMES FOR LARGE DATA
Import import dask.dataframe as dd
Read CSV data df = dd.read_csv('my-data.*.csv')
Read Parquet data df = dd.read_parquet('my-data.parquet')
Filter and manipulate data with Pandas syntax df['z'] = df.x + df.y
Standard groupby aggregations, joins, etc. result = df.groupby(df.z).y.mean()
Compute result as a Pandas dataframe out = result.compute()
Or store to CSV, Parquet, or other formats result.to_parquet('my-output.parquet')
EXAMPLE df = dd.read_csv('filenames.*.csv')
df.groupby(df.timestamp.day)\
.value.mean().compute()
DASK ARRAYS PARALLEL NUMPY ARRAYS FOR LARGE DATA
Import import dask.array as da
Create from any array-like object import h5py
dataset = h5py.File('my-data.hdf5')['/group/dataset']
Including HFD5, NetCDF, or other x = da.from_array(dataset, chunks=(1000, 1000))
on-disk formats.
Alternatively generate an array from a random da.random.uniform(shape=(1e4, 1e4), chunks=(100, 100))
distribution.
Perform operations with NumPy syntax y = x.dot(x.T - 1) - x.mean(axis=0)
Compute result as a NumPy array result = y.compute()
Or store to HDF5, NetCDF or other out = f.create_dataset(...)
on-disk format x.store(out)
EXAMPLE with h5py.File('my-data.hdf5') as f:
x = da.from_array(f['/path'], chunks=(1000, 1000))
x -= x.mean(axis=0)
out = f.create_dataset(...)
x.store(out)
DASK BAGS PARELLEL LISTS FOR UNSTRUCTURED DATA
Import import dask.bag as db
Create Dask Bag from a sequence b = db.from_sequence(seq, npartitions)
Or read from text formats b = db.read_text('my-data.*.json')
Map and filter results import json
records = b.map(json.loads)
.filter(lambda d: d["name"] == "Alice")
Compute aggregations like mean, count, sum records.pluck('key-name').mean().compute()
Or store results back to text formats records.to_textfiles('output.*.json')
EXAMPLE db.read_text('s3://bucket/my-data.*.json')
.map(json.loads)
.filter(lambda d: d["name"] == "Alice")
.to_textfiles('s3://bucket/output.*.json')

CONTINUED ON BACK →
DASK COLLECTIONS (CONTINUED) 141
ADVANCED
Read from distributed file systems or df = dd.read_parquet('s3://bucket/myfile.parquet')
cloud storage
Prepend prefixes like hdfs://, s3://, b = db.read_text('hdfs:///path/to/my-data.*.json')
or gcs:// to paths
Persist lazy computations in memory df = df.persist()
Compute multiple outputs at once dask.compute(x.min(), x.max())
CUSTOM COMPUTATIONS FOR CUSTOM CODE AND COMPLEX ALGORITHMS
DASK DELAYED LAZY PARALLELISM FOR CUSTOM CODE
Import import dask
Wrap custom functions with the @dask.delayed
@dask.delayed annotation def load(filename):
...
Delayed functions operate lazily, @dask.delayed
producing a task graph rather than def process(data):
executing immediately ...
Passing delayed results to other load = dask.delayed(load)
delayed functions creates process = dask.delayed(process)
dependencies between tasks
Call functions in normal code data = [load(fn) for fn in filenames]
results = [process(d) for d in data]
Compute results to execute in parallel dask.compute(results)
CONCURRENT.FUTURES ASYNCHRONOUS REAL-TIME PARALLELISM
Import from dask.distributed import Client
Start local Dask Client client = Client()
Submit individual task asynchronously future = client.submit(func, *args, **kwargs)
Block and gather individual result result = future.result()
Process results as they arrive for future in as_completed(futures):
...
EXAMPLE L = [client.submit(read, fn) for fn in filenames]
L = [client.submit(process, future) for future in L]
future = client.submit(sum, L)
result = future.result()
SET UP CLUSTER HOW TO LAUNCH ON A CLUSTER
MANUALLY
Start scheduler on one machine $ dask-scheduler
Scheduler started at SCHEDULER_ADDRESS:8786
Start workers on other machines host1$ dask-worker SCHEDULER_ADDRESS:8786
Provide address of the running scheduler host2$ dask-worker SCHEDULER_ADDRESS:8786
Start Client from Python process from dask.distributed import Client
client = Client('SCHEDULER_ADDRESS:8786')
ON A SINGLE MACHINE
Call Client() with no arguments for easy client = Client()
setup on a single host
CLOUD DEPLOYMENT
See dask-kubernetes project for Google Cloud pip install dask-kubernetes
See dask-ec2 project for Amazon EC2 pip install dask-ec2

MORE RESOURCES
User Documentation dask.pydata.org
Technical documentation for distributed scheduler distributed.readthedocs.org
Report a bug github.com/dask/dask/issues

anaconda.com · info@anaconda.com · 512-776-1066


8/20/2017
Python For Data Science Cheat Sheet Duplicate Values GroupBy
>>> df = df.dropDuplicates() >>> df.groupBy("age")\ Group by age, count the members
PySpark - SQL Basics .count() \ in the groups
.show()
Learn Python for data science Interactively at www.DataCamp.com Queries
>>> from pyspark.sql import functions as F Filter
Select
>>> df.select("firstName").show() Show all entries in firstName column >>> df.filter(df["age"]>24).show() Filter entries of age, only keep those
>>> df.select("firstName","lastName") \ records of which the values are >24
PySpark & Spark SQL .show()
>>> df.select("firstName", Show all entries in firstName, age
Spark SQL is Apache Spark's module for "age", and type
explode("phoneNumber") \ Sort
working with structured data. .alias("contactInfo")) \
.select("contactInfo.type", >>> peopledf.sort(peopledf.age.desc()).collect()
"firstName", >>> df.sort("age", ascending=False).collect()
Initializing SparkSession "age") \ >>> df.orderBy(["age","city"],ascending=[0,1])\
A SparkSession can be used create DataFrame, register DataFrame as tables, .show() .collect()
execute SQL over tables, cache tables, and read parquet files. >>> df.select(df["firstName"],df["age"]+ 1) Show all entries in firstName and age,
.show() add 1 to the entries of age
>>> from pyspark.sql import SparkSession >>> df.select(df['age'] > 24).show() Show all entries where age >24
>>> spark = SparkSession \ When
Missing & Replacing Values
.builder \ >>> df.select("firstName", Show firstName and 0 or 1 depending
.appName("Python Spark SQL basic example") \ >>> df.na.fill(50).show() Replace null values
F.when(df.age > 30, 1) \ on age >30 >>> df.na.drop().show() Return new df omi!ing rows with null values
.config("spark.some.config.option", "some-value") \ .otherwise(0)) \
.getOrCreate() >>> df.na \ Return new df replacing one value with
.show() .replace(10, 20) \ another
>>> df[df.firstName.isin("Jane","Boris")] Show firstName if in the given options .show()
.collect()
Creating DataFrames Like
>>> df.select("firstName", Show firstName, and lastName is Repartitioning
From RDDs df.lastName.like("Smith")) \ TRUE if lastName is like Smith
.show()
>>> from pyspark.sql.types import * Startswith - Endswith >>> df.repartition(10)\ df with 10 partitions
>>> df.select("firstName", Show firstName, and TRUE if .rdd \
Infer Schema .getNumPartitions()
>>> sc = spark.sparkContext df.lastName \ lastName starts with Sm
.startswith("Sm")) \ >>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition
>>> lines = sc.textFile("people.txt")
.show()
>>> parts = lines.map(lambda l: l.split(",")) >>> df.select(df.lastName.endswith("th")) \ Show last names ending in th
>>> people = parts.map(lambda p: Row(name=p[0],age=int(p[1]))) .show() Running SQL Queries Programmatically
>>> peopledf = spark.createDataFrame(people) Substring
Specify Schema >>> df.select(df.firstName.substr(1, 3) \ Return substrings of firstName Registering DataFrames as Views
>>> people = parts.map(lambda p: Row(name=p[0], .alias("name")) \
age=int(p[1].strip()))) .collect() >>> peopledf.createGlobalTempView("people")
>>> schemaString = "name age" Between >>> df.createTempView("customer")
>>> fields = [StructField(field_name, StringType(), True) for >>> df.select(df.age.between(22, 24)) \ Show age: values are TRUE if between >>> df.createOrReplaceTempView("customer")
field_name in schemaString.split()] .show() 22 and 24
>>> schema = StructType(fields) Query Views
>>> spark.createDataFrame(people, schema).show()
+--------+---+ Add, Update & Remove Columns
| name|age| >>> df5 = spark.sql("SELECT * FROM customer").show()
+--------+---+ >>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
| Mine| 28| Adding Columns
| Filip| 29| .show()
|Jonathan| 30|
+--------+---+ >>> df = df.withColumn('city',df.address.city) \
.withColumn('postalCode',df.address.postalCode) \
.withColumn('state',df.address.state) \ Output
From Spark Data Sources .withColumn('streetAddress',df.address.streetAddress) \
.withColumn('telePhoneNumber', Data Structures
JSON explode(df.phoneNumber.number)) \
>>> df = spark.read.json("customer.json") .withColumn('telePhoneType',
>>> df.show() >>> rdd1 = df.rdd Convert df into an RDD
+--------------------+---+---------+--------+--------------------+ explode(df.phoneNumber.type)) >>> df.toJSON().first() Convert df into a RDD of string
| address|age|firstName |lastName| phoneNumber|
+--------------------+---+---------+--------+--------------------+ >>> df.toPandas() Return the contents of df as Pandas
|[New York,10021,N...| 25| John| Smith|[[212 555-1234,ho...| Updating Columns DataFrame
|[New York,10021,N...| 21| Jane| Doe|[[322 888-1234,ho...|
+--------------------+---+---------+--------+--------------------+
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') Write & Save to Files
>>> df2 = spark.read.load("people.json", format="json")
Parquet files Removing Columns >>> df.select("firstName", "city")\
>>> df3 = spark.read.load("users.parquet") .write \
TXT files >>> df = df.drop("address", "phoneNumber") .save("nameAndCity.parquet")
>>> df4 = spark.read.text("people.txt") >>> df = df.drop(df.address).drop(df.phoneNumber) >>> df.select("firstName", "age") \
.write \
.save("namesAndAges.json",format="json")
Inspect Data
>>> df.dtypes Return df column names and data types >>> df.describe().show() Compute summary statistics Stopping SparkSession
>>> df.show() Display the content of df >>> df.columns Return the columns of df
>>> df.count() Count the number of rows in df >>> spark.stop()
>>> df.head() Return first n rows
>>> df.first() Return first row >>> df.distinct().count() Count the number of distinct rows in df
>>> df.printSchema() Print the schema of df
142

>>> df.take(2) Return the first n rows DataCamp


>>> df.schema Return the schema of df >>> df.explain() Print the (logical and physical) plans
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data
Basic Information Reducing
PySpark - RDD Basics >>> rdd.reduceByKey(lambda x,y : x+y) Merge the rdd values for
>>> rdd.getNumPartitions() List the number of partitions .collect() each key
Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count() Count RDD instances [('a',9),('b',2)]
3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values
>>> rdd.countByKey() Count RDD instances by key ('a',7,'a',2,'b',2)
defaultdict(<type 'int'>,{'a':2,'b':1}) Grouping by
>>> rdd.countByValue() Count RDD instances by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values
Spark defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1}) .mapValues(list)
>>> rdd.collectAsMap() Return (key,value) pairs as a .collect()
PySpark is the Spark Python API that exposes {'a': 2,'b': 2} dictionary >>> rdd.groupByKey() Group rdd by key
>>> rdd3.sum() Sum of RDD elements .mapValues(list)
the Spark programming model to Python. 4950 .collect()
>>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])]
True
Initializing Spark Aggregating
Summary >>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
SparkContext >>> rdd3.max() Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each
99 (4950,100) partition and then the results
>>> from pyspark import SparkContext >>> rdd3.min() Minimum value of RDD elements
>>> sc = SparkContext(master = 'local[2]') >>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key
0
>>> rdd3.mean() Mean value of RDD elements .collect()
Inspect SparkContext 49.5 [('a',(9,2)), ('b',(2,1))]
>>> rdd3.stdev() Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each
>>> sc.version Retrieve SparkContext version 28.866070047722118 4950 partition, and then the results
>>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key
>>> sc.master Master URL to connect to 833.25 .collect()
>>> str(sc.sparkHome) Path where Spark is installed on worker nodes >>> rdd3.histogram(3) Compute histogram by bins [('a',9),('b',2)]
>>> str(sc.sparkUser()) Retrieve name of the Spark User running ([0,33,66,99],[33,33,34])
>>> rdd3.stats() Summary statistics (count, mean, stdev, max & >>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by
SparkContext
min) .collect() applying a function
>>> sc.appName Return application name
>>> sc.applicationId Retrieve application ID
>>> sc.defaultParallelism Return default level of parallelism Mathematical Operations
>>> sc.defaultMinPartitions Default minimum number of partitions for Applying Functions
RDDs >>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd.subtract(rdd2) Return each rdd value not contained
.collect() .collect() in rdd2
Configuration [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] [('b',2),('a',7)]
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2
>>> from pyspark import SparkConf, SparkContext and fla!en the result .collect() with no matching key in rdd
>>> conf = (SparkConf() >>> rdd5.collect() [('d', 1)]
.setMaster("local") ['a',7,7,'a','a',2,2,'a','b',2,2,'b'] >>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd
.setAppName("My app") >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each (key,value) and rdd2
.set("spark.executor.memory", "1g")) .collect() pair of rdd4 without changing the keys
>>> sc = SparkContext(conf = conf) [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]
Sort
Using The Shell Selecting Data >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function
.collect()
In the PySpark shell, a special interpreter-aware SparkContext is already Ge!ing [('d',1),('b',1),('a',2)]
created in the variable called sc. >>> rdd.collect() Return a list with all RDD elements >>> rdd2.sortByKey() Sort (key, value) RDD by key
[('a', 7), ('a', 2), ('b', 2)] .collect()
$ ./bin/spark-shell --master local[2] >>> rdd.take(2) Take first 2 RDD elements [('a',2),('b',1),('d',1)]
$ ./bin/pyspark --master local[4] --py-files code.py [('a', 7), ('a', 2)]
Set which master the context connects to with the --master argument, and >>> rdd.first() Take first RDD element
('a', 7) Repartitioning
add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd.top(2) Take top 2 RDD elements
[('b', 2), ('a', 7)] >>> rdd.repartition(4) New RDD with 4 partitions
comma-separated list to --py-files. >>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1
Sampling
Loading Data >>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97] Saving
Parallelized Collections Filtering >>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)]) .collect() 'org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) [('a',7),('a',2)]
>>> rdd3 = sc.parallelize(range(100)) >>> rdd5.distinct().collect() Return distinct RDD values
>>> rdd4 = sc.parallelize([("a",["x","y","z"]), ['a',2,'b',7] Stopping SparkContext
("b",["p", "r"])]) >>> rdd.keys().collect() Return (key,value) RDD's keys
['a', 'a', 'b'] >>> sc.stop()
External Data
Read either one text file from HDFS, a local file system or or any Iterating Execution
Hadoop-supported file system URI with textFile(), or read in a directory >>> def g(x): print(x)
>>> rdd.foreach(g) Apply a function to all RDD elements $ ./bin/spark-submit examples/src/main/python/pi.py
of text files with wholeTextFiles().
('a', 7)
>>> textFile = sc.textFile("/my/directory/*.txt")
143

('b', 2) DataCamp
>>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively
NLP Cheat Sheet Context-free grammar HMMs and the Viterbi decoder
A context-free grammar consists of: HMM is a Weighted Finite-state Automaton (WFSA)
Bag of Words Methods • a set of non-terminal symbols A, B, C, · · · 2 N • Each transition arc is associated with a probability
Information Retrieval • a set of terminal symbols a, b, c, · · · 2 T • The sum of all arcs outgoing from a single node is 1
Cosine similarity metric • a start symbol S 2 N Viterbi decoder’s time complexity is O(s2 n) and the space
complexity is O(sn) for s states and n words.
Define a similarity metric between topic vectors. It is the • a set of productions P of the form N ! (N U T )⇤
cosine of the angle between the term vectors. viterbi[s, t] = maxs0 (viterbi[s0 , t 1] ⇥ p(s, s0 ) ⇥ P (token[t]|s))
CFG can be used to generate sentences, recognize sentences or for each s, t: record which s, t 1 contributed the maximum.
ai ⇥ b i parse sentences.
2 2 Chunkers and name taggers
P

i ai ⇥ i bi Parsing
BIO (IOB) tags
sim(A, B) = qP i qP

Unusual words determine the topic much more than common The CKY algorithm
words. Weight each term frequency tfi by its inverse I ! inside a baseNP
document frequency idf . The grammar has to be in Chomsky normal form (CNF). O ! outside a baseNP
Every production must have the form A ! BC or A ! a. B ! start of a baseNP which follows another baseNP
N
idfi = log( )
ni • Use a parse table P T [i, j]
Maximum entropy
where N =size of collection and ni = number of documents – each entry is a list of grammar symbols MaxEnt is typically used for a multi-class classifier. We are
containing term i given a set of training data, where each datum is labeled with
– P T [i, j] includes x if the words from i up to (but
a set of features and a class (tag). Each feature-class pair
wi = tfi ⇥ idfi not including) j can be analyzed as an x
2
constitutes an indicator function. We train a classifier using
w2A,B tfw,A ⇥tfw,B ⇥idfw
this data, computing the s. We can then classify new data by
sim(A, B) = 2 2
P

selecting the class (tag) which maximizes p(h, t).


• Two steps
w1 2A (tfw1 ,A idfw1 ) ⇥ w2 2B (tfw2 ,A idfw2 )
qP qP

– if there is a rule A ! a and word i is an a, put an MaxEnt models can be trained by providing a set of K features
Sentiment analysis (opinion mining) A in P T [i, i + 1] in the form of binary-valued indicator functions fi (h, t).
Opinion mining is the task of judging whether a document We will use a log-linear model, where
expresses a positive or a negative opinion (or no opinion)
– if there is a rule A ! BC, P T [i, j] includes A and fi (h,t)
P T [j, k] includes B, add C to P T [i, k] p(h, t) = (1/Z) K i=1 ↵i
regarding a particular object or topic.
where ↵i is the weight for feature i, and Z is a normalizing
Q

The simplest strategy uses a bag-of-words model. We create Time complexity = O(n3 ) constant.
lists of ’positive’ and ’negative’ words and judge a document Space complexity = O(n2 )
based on whether it has a preponderance of positive or Maximum Entropy Markov Models (MEMM)
negative words (and judge it neutral or ’objective’ if it has few
words of either category.
POS tagging Maximum entropy modeling can be combined with a Markov
model, so that, for each state, the probability of a transition to
Creating such lists is not easy; some of the words are likely to Accuracy that state is computed by a Max Ent model, based on the prior
be quite di↵erent for di↵erent kinds of topics. As an
for part-of-speech tagging, accuracy is a simple and reasonable state and arbitrary features of the input sequence. The result
alternative, we will take a collection of documents on some
metric. is a Maximum Entropy Markov Model (MEMM). Decoding
topic and rate them (by hand) as positive or negative. We will with correct tag
accuracy = tokenstotaltokens (selecting the best tag sequence) can be done deterministically
then use this collection to train a classifier model.
left-to-right or (better) using Viterbi decoding like an HMM.
c = argmaxc2C P (c) i2positions P (wi |c)
Precision and recall
count(docs labeled t) HMM vs MEMM
P (c) =
Q

N Instead of counting the tags themselves, we count the names


count(docs labeled t containing wi )
P (wi |c) = count(docs labeled t) defined by these tags: key = number of names in key response HMM is a generative model, while MEMM is a discriminative
count(wi ,c)+1 = number of names in system response correct = number of model. MEMM can incorporate long distance features.
w2V count(w,c))+|V | names in response which exactly match (in type and extent) a
The bag-of-words strategy with NB model does quite well for name in the key Feature engineering
Smoothing: P (wi |c) = (P

short reviews, but fails for: precision = correct/response When using a package such as MaxEnt, the computational
• ambiguous terms recall = correct/key linguist’s job becomes one of feature engineering – identifying
Also the features which will be most predictive of the tags we are
• negation precision = True Positives/(True Positives + False Positives) trying to assign. Simple features are the current token (or the
recall = True Positives/(True Positives + False Negatives) previous token, or the next token) having a particular value, or
• comparative reviews
F-measure: the harmonic mean of recall and precision. being on a particular list (such as a list of common titles or
• revealing aspects of an opinion
precision+recall
F = 2 ⇥ precision⇥recall common first names).
144
Lexical semantics and word sense Learning Hobbs search
disambiguation Supervised learning: All training data is labeled In selecting which sentences to search, Hobbs starts with the
Semi-Supervised learning sentence containing the anaphor and then selects prior
Terminology sentences from right to left, examining first the immediately
• Part of training data is labeled
• multiple senses of a word preceding sentence. However, within an individual sentence he
• Make use of redundancies to learn labels of additional searches from left to right; this is the way he captures the
• polysemy (and homonymy for totally unrelated senses data, then train model preference for antecedents in subject position. The resulting
(”bank”))
• Co-training search order defines a Hobbs distance: if in searching for
• metonomy for certain types of regular, productive antecedents of X the d th candidate we consider is Y , we say
polysemy (”the White House”, ”Washington”) • Reduces amount of data which must be hand-labeled to that the Hobbs distance from X to Y is d.
achieve a given level of performance
• zeugma (conjunction combining distinct senses) as test
Active learning: Resolving pronoun reference
for polysemy (”serve”)
• Start with partially labeled data Constraints:
• synonymy: when two words mean (more-or-less) the
same thing
• animacy
• System selects additional ’informative’ examples for
user to label • gender
• hyponymy: X is the hyponym of Y if X denotes a more
specific subclass of Y (X is the hyponym, Y is the Unsupervised learning: driven entirely by an unlabeled • number
hypernym) corpus
Preferences:
Distant supervision: given a large amount of tabular
• synsets: synonym sets information, we use a heuristic procedure to (partially) label a • recent: at most 3 sentences back
text corpus. Distant supervision works roughly like a half
Word Sense Disambiguation • salient: mentioned several times recently
iteration of bootstrapping, but with a large set of seeds. Given
process of identifying the sense of a word in context a data base with a large set of instances of a relation and a • subjects:
Simple supervised WSD algorithm: naive Bayes: large raw text corpus, we use the instances to label the corpus Selectional preferences: prefer antecedent that is more likely to
= argmax(senses)P (s|F ) = and then train a classifier over the corpus. occur in context of pronoun.
P.positive
argmax(s)P (s) i P (f [i]|s) where F = {f 1, f 2, · · · } is the set Confidence of pattern P: Conf (P ) = P.positive+P.negative Ge, Hale, and Charniak formula:
of context features where P.positive = number of positive matches to pattern P
selected sense s’Q

P = P 1(correct antecedent is at Hobbs distance d) ⇥


Training a naive Bayes classifier consists of estimating each of and P.negative = number of negative matches to pattern P P 2(pronoun|head of antecedent) ⇥
these probabilities.
count(s ,w )
i j
P 3(antecedent|mention count) ⇥
P (si ) = count(w Reference resolution P 4(head of antecedent|context of pronoun)
j)
Number of times the sense si occurs in wj and divide by the Identify all phrases which refer to the same real-word entity.
Models
total count of the target word wj .
count(fj ,s)
Terminology • mention-pair model: train binary classifier
count(s) Referent: real world object referred to
P (fj |s) =
Referring expression [mention]: a phrase referring to that – scan mention in text order
Word Embeddings
object – link each mention to the closest antecedent
Word embeddings are low-dimensional (50-200 element) real
Discourse entities: the set of objects referred to by a text classified
vectors that encode information on the contexts in which a
Coreference: two expressions referring to the same thing.
word appears. Each component of the vector combines – link each mention to antecedent most confidently
Antecedent: prior expression
information on multiple contexts. labeled
Anaphor: following expression
The embeddings can be produced by dimensionality reduction – cluster mentions
Types of referring expressions: definite pronouns (he, she,
on the context matrix.
it), indefinite pronouns (one), definite NPs (the car), indefinite
NPs (a car), names.
• entity-mention model: uses information from all
Information Content mentions gathered so far
P (c): for each concept (synset), P (c) = probability that a Wikification: entity linking, so map each document-level
word in a corpus is an instance of the concept (matches the entity to an entry in a standard database as Wikipedia
Question Answering
synset c or one of its hyponyms) Complications
Information content of a concept IC(c) = log P (c) Mean Reciprocal Rank
Typically for evaluations QA systems generate a ranked set of
• inferrables: sometimes the relation between anaphor
and antecedent is not one of identity
Probabilistic CFG answers to each query and are scored based on the rank of the
Associate a probability with each production • zero Anaphora: many languages allow subject omission, first correct answer: mean reciprocal rank
and some allow omission of other arguments
• sum of probabilities of productions expanding a symbol
=1 • Cataphora: pronoun referring to a following mention M RR = ( )
N i ranki
1 X 1

Each question is then scored according to the reciprocal of the


• use MLE • Bridging Anaphora: reference to related object
P (A ! ↵) = count(A ! ↵)/ count(A| ) • Non-NP Anaphora: pronouns can also refer to events or rank of the first correct answer. The score of a system is then
Probability of parse = product of prob. of productions propositions the average of the score for each question in the set.
P
145
Machine translation • All alignments are equally likely (a very crude model)
The Noisy Channel Model Translation model generates a sentence of F (given E) in 3 P (F, A|E)
Suppose we are translating French/Foreign sentences F to steps:
P (F |A) =
A A P (A|F, E)
English E We want to determine the English sentence most
X
P (F, A|E)/P (A|F, E) = P

likely to have generated it:


• pick a length for F
The heart of the translation model is the word translation
E 0 = argmaxE P (E|F ) • pick an alignment of F (length J) and E (length I+1)

probabilities t(fj |ei ).
P (A|E) = (I+1) J
P (F |E)P (E) Algorithm:
= argmaxE
P (F ) • pick the ith word of F based on English word ej with • EM begins by assuming all word translations t(fi |ej )
= argmaxE P (F |E)P (E) are equally likely
P (F |E, A) = j t(fj , eaj )
It combines translation model P (F |E) and language • compute probabilities of alignments given word
which it alignsQusing distribution t(fi |ej )

model P (E). P (F |E) is learned from parallel corpus, P (E) EM training translation probabilities (E-step) following (??) and
can be learned from large monolingual corpus. (??)
Typically we are given sentence aligned but not word aligned
The Lexical Translation Model corpora, so we cannot compute these probabilities by direct • compute counts of aligned word pairs, weighted by
IBM Model 1 counting and instead use an iterative procedure called ”EM”. alignment probabilities (E-step)
Translation is based on alignment. Mapping a target word at • recompute MLE word translation probabilities from
position i to a source word at position j. Assumptions: ✏ these counts (M-step)
P (F, A|E) = P (F |E, A) ⇥ P (A|E) = t(fj , eaj )
• Each target word is generated by exactly one source (I + 1)J j
Y

word (1)
• repeat
• A target word can be generated by a NULL word;
multiple target words can be generated by the same P (F, A|E) Copyright c 2018 Antonio Mallia
source word (2) https://www.antoniomallia.it
A P (F, A|E)
P (A|F, E) = P
146
147

Text Analysis with NLTK Cheatsheet

>>> import nltk


>>> nltk.download() This step will bring up a window in which you can download ‘All Corpora’
>>> from nltk.book import *

Basics
tokens >>> text1[0:100] - first 101 tokens
>>> text2[5] - fifth token
concordance >>> text3.concordance(‘begat’) - basic keyword-in-context
>>> text1.concordance(‘sea’, lines=100) - show other than default 25 lines
>>> text1.concordance(‘sea’, lines=all) - show all results
>>> text1.concordance(‘sea’, 10, lines=all) - change left and right context width to
10 characters and show all results
similar >>> text3.similar(‘silence’) - finds all words that share a common context
common_contexts >>>text1.common_contexts([‘sea’,’ocean’])

Counting
Count a string >>>len(‘this is a string of text’) – number of characters
Count a list of tokens >>>len(text1) –number of tokens
Make and count a list of >>>len(set(text1)) – notice that set return a list of unique tokens
unique tokens
Count occurrences >>> text1.count(‘heaven’) – how many times does a word occur?
Frequency >>>fd = nltk.FreqDist(text1) – creates a new data object that contains information
about word frequency
>>>fd[‘the’] – how many occurences of the word ‘the’
>>>fd.keys() – show the keys in the data object
>>>fd.values() – show the values in the data object
>>>fd.items() – show everything
>>>fd.keys()[0:50] – just show a portion of the info.
Frequency plots >>>fd.plot(50,cumulative=False) – generate a chart of the 50 most frequent words
Other FreqDist functions >>>fd.hapaxes()
>>>fd.freq(‘the’)
Get word lengths >>>lengths = [len(w) for w in text1]
And do FreqDist >>> fd = nltk.FreqDist(lengths)
FreqDist as a table >>>fd.tabulate()

Normalizing
De-punctuate >>>[w for w in text1 if w.isalpha() ] – not so much getting rid of punctuation, but
De-uppercaseify (?) keeping alphabetic characters
>>>[w.lower() for w in text] – make each word in the tokenized list lowercase
>>>[w.lower() for w in text if w.isalpha()] – all in one go
Sort >>>sorted(text1) – careful with this!
Unique words >>>set(text1) – set is oddly named, but very powerful. Leaves you with a list of
only one of each word.
Exclude stopwords Make your own list of word to be excluded:
>>>stopwords = [‘the’,’it’,’she’,’he’]
>>>mynewtext = [w for w in text1 if w not in stopwords]
Or you can also use predefined stopword lists from NLTK:
>>>from nltk.corpus import stopwords
>>>stopwords = stopwords.words(‘english’)
>>> mynewtext = [w for w in text1 if w not in stopwords]
148

Searching
Dispersion plot >>>text4.dispersion_plot([‘American’,’Liberty’,’Government’])
Find word that end with… >>>[w for w in text4 if w.endswith(‘ness’)]
Find words that start with… >>>[w for w in text4 if w.startsswith(‘ness’)]
Find words that contain… >>>[w for w in text4 if ‘ee’ in w]
Combine them together: >>>[w for w in text4 if ‘ee’ in w and w.endswith(‘ing’)]
Regular expressions ‘Regular expressions’ is a syntax for describing sequences of characters usually
used to construct search queries. The Python ‘re’ module must first be imported:
>>>import re
>>>[w for w in text1 if re.search('^ab',w)] – ‘Regular expressions’ is too big of a
topic to cover here. Google it!

Chunking
Collocations are good for getting a quick glimpse of what a text is about
Collocations >>> text4.collocations() - multi-word expressions that commonly co-occur. Notice
that is not necessarily related to the frequency of the words.
>>>text4.collocations(num=100) – alter the number of phrases returned
Bigrams, Trigrams, and n-grams are useful for comparing texts, particularly for
plagiarism detection and collation
Bi-grams >>>nltk.bigrams(text4) – returns every string of two words
Tri-grams >>>nltk.trigrams(text4) – return every string of three words
n-grams >>>nltk.ngrams(text4, 5)

Tagging
part-of-speech tagging >>>mytext = nltk.word_tokenize(“This is my sentence”)
>>> nltk.pos_tag(mytext)

Working with your own texts:


Open a file for reading >>>file = open(‘myfile.txt’) – make sure you are in the correct directory before
starting Python
Read the file >>>t = file.read();
Tokenize the text >>>tokens = nltk.word_tokenize(t)
Convert to NLTK Text object >>>text = nltk.Text(tokens)

Quitting Python
Quit >>>quit()

Part-of-Speech Codes

CC Coordinating conjunction NNS Noun, plural UH Interjection


CD Cardinal number NNP Proper noun, singular VB Verb, base form
DT Determiner NNPS Proper noun, plural VBD Verb, past tense
EX Existential there PDT Predeterminer VBG Verb, gerund or present
FW Foreign word POS Possessive ending participle
IN Preposition or subordinating PRP Personal pronoun VBN Verb, past participle
conjunction PRP$ Possessive pronoun VBP Verb, non-3rd person singular
JJ Adjective RB Adverb present
JJR Adjective, comparative RBR Adverb, comparative VBZ Verb, 3rd person singular
JJS Adjective, superlative RBS Adverb, superlative present
LS List item marker RP Particle WDT Wh-determiner
MD Modal SYM Symbol WP Wh-pronoun
NN Noun, singular or mass TO to WP$ Possessive wh-pronoun
WRB Wh-adverb
149

Resources
Commands for altering lists – useful in
Python for Humanists 1: Why Learn Python? creating stopword lists
http://www.rogerwhitson.net/?p=1260
list.append(x) - Add an item to the end of the list
list.insert(i, x) - Insert an item, i, at position, x.
‘Natural Language Processing with Python’ book online list.remove(x) - Remove item whose value is x.
http://www.nltk.org/book/ list.pop(x) - Remove item numer x from the list.
150

Natural Language Processing with Python & nltk Cheat Sheet


by RJ Murray (murenei) via cheatography.com/58736/cs/15485/

Handling Text Sentence Parsing

text=​'Some words' assign string g=nlt​k.d​ata.lo​ad(​'gr​amm​ar.c​fg') Load a grammar from a file

list(​text) Split text into character tokens g=nlt​k.C​FG.f​ro​mst​rin​g("""..."" Manually define grammar
")
set(t​ext) Unique tokens
parse​r=n​ltk.Ch​art​Par​ser(g) Create a parser out of the
len(t​ext) Number of characters
grammar

trees​=pa​rse​r.p​ars​e_a​ll(​text)
Accessing corpora and lexical resources
for tree in trees: ... print tree
from nltk.c​orpus import import Corpus​Reader object
brown from nltk.c​orpus import treebank

brown.wo​rds​(te​xt_id) Returns pretok​enised document as list treeb​ank.pa​rse​d_s​ent​s('​wsj​_00​0 Treebank parsed sentences


of words 1.m​rg')

brown.fi​lei​ds() Lists docs in Brown corpus


Text Classi​fic​ation
brown.ca​teg​ori​es() Lists categories in Brown corpus
from sklear​n.f​eat​ure​_ex​tra​cti​on.text import
Tokeni​zation CountV​ect​orizer, TfidfV​ect​orizer

text.s​pl​it(​" ") Split by space vect=​Cou​ntV​ect​ori​zer​().f​it​(X_​tr Fit bag of words model to


ain) data
nltk.w​or​d_t​oke​niz​er(​text) nltk in-built word tokenizer
vect.g​et​_fe​atu​re_​nam​es() Get features
nltk.s​en​t_t​oke​niz​e(doc) nltk in-built sentence tokenizer
vect.t​ra​nsf​orm​(X_​train) Convert to doc-term matrix

Lemmat​ization & Stemming


Entity Recogn​ition (Chunk​ing​/Ch​inking)
input​="List listed lists listing Different suffixes
listin​gs" g="NP: {<D​T>?​<JJ​>*<​NN>​}" Regex chunk grammar

words​=in​put.lo​wer​().s​plit(' ') Normalize (lower​case) cp=nl​tk.R​eg​exp​Par​ser(g) Parse grammar


words
ch=cp.pa​rse​(po​s_s​ent) Parse tagged sent. using grammar
porte​r=n​ltk.Po​rte​rSt​emmer Initialise Stemmer
print​(ch) Show chunks
[port​er.s​tem(t) for t in words] Create list of stems
ch.dr​aw() Show chunks in IOB tree
WNL=n​ltk.Wo​rdN​etL​emm​ati​zer() Initialise WordNet
cp.ev​alu​ate​(te​st_​sents) Evaluate against test doc
lemmatizer
sents​=nl​tk.c​or​pus.tr​eeb​ank.ta​gge​d_s​ents()
[WNL.l​em​mat​ize(t) for t in words] Use the lemmatizer
print​(nl​tk.n​e_​chu​nk(​sent)) Print chunk tree
Part of Speech (POS) Tagging

nltk.h​el​p.u​pen​n_t​ags​et Lookup definition for a POS tag


(​'MD')

nltk.p​os​_ta​g(w​ords) nltk in-built POS tagger

<use an altern​ative tagger to illustrate


ambigu​ity>

By RJ Murray (murenei) Published 28th May, 2018. Sponsored by CrosswordCheats.com


cheatography.com/murenei/ Last updated 29th May, 2018. Learn to solve cryptic crosswords!
tutify.com.au Page 1 of 2. http://crosswordcheats.com
151

Natural Language Processing with Python & nltk Cheat Sheet


by RJ Murray (murenei) via cheatography.com/58736/cs/15485/

RegEx with Pandas & Named Groups

df=pd.Da​taF​ram​e(t​ime​_sents, column​s=[​'te​xt'])

df['t​ext​'].s​tr.sp​lit​().s​tr.len()

df['t​ext​'].s​tr.co​nta​ins​('w​ord')

df['t​ext​'].s​tr.co​unt​(r'​\d')

df['t​ext​'].s​tr.fi​nda​ll(​r'\d')

df['t​ext​'].s​tr.re​pla​ce(​r'​\w+d​ay\b', '???')

df['t​ext​'].s​tr.re​pla​ce(​r'(​\w)', lambda x: x.grou​ps(​)


[0​][:3])

df['t​ext​'].s​tr.ex​tra​ct(​r'(​\d?​\d)​:(​\d\d)')

df['t​ext​'].s​tr.ex​tra​cta​ll(​r'(​(\d​?\d​):(​\d\d) ?
([ap]​m))')

df['t​ext​'].s​tr.ex​tra​cta​ll(​r'(​?P<​dig​its​>\d)')

By RJ Murray (murenei) Published 28th May, 2018. Sponsored by CrosswordCheats.com


cheatography.com/murenei/ Last updated 29th May, 2018. Learn to solve cryptic crosswords!
tutify.com.au Page 2 of 2. http://crosswordcheats.com
152

Introduction
According to industry estimates, only 21% of the available data is present in structured
form. Data is being generated as we speak, as we tweet, as we send messages on
Whatsapp and in various other activities. Majority of this data exists in the textual form,
which is highly unstructured in nature.

Few notorious examples include – tweets / posts on social media, user to user chat
conversations, news, blogs and articles, product or services reviews and patient records in
the healthcare sector. A few more recent ones includes chatbots and other voice driven bots.

Despite having high dimension data, the information present in it is not directly accessible
unless it is processed (read and understood) manually or analyzed by an automated system.

In order to produce significant and actionable insights from text data, it is important to get
acquainted with the techniques and principles of Natural Language Processing (NLP).

So, if you plan to create chatbots this year, or you want to use the power of unstructured text,
this guide is the right starting point. This guide unearths the concepts of natural language
processing, its techniques and implementation. The aim of the article is to teach the concepts
of natural language processing and apply it on real data set. Moreover, we also have a video
based course on NLP with 3 real life projects.
153

Table of Contents
1. Introduction to NLP
2. Text Preprocessing
o Noise Removal
o Lexicon Normalization
§ Lemmatization
§ Stemming
o Object Standardization
3. Text to Features (Feature Engineering on text data)
o Syntactical Parsing
§ Dependency Grammar
§ Part of Speech Tagging
o Entity Parsing
§ Phrase Detection
§ Named Entity Recognition
§ Topic Modelling
§ N-Grams
o Statistical features
§ TF – IDF
§ Frequency / Density Features
§ Readability Features
o Word Embeddings
4. Important tasks of NLP
o Text Classification
o Text Matching
§ Levenshtein Distance
§ Phonetic Matching
§ Flexible String Matching
o Coreference Resolution
o Other Problems
5. Important NLP libraries
154

1. Introduction to Natural Language Processing


NLP is a branch of data science that consists of systematic processes for analyzing,
understanding, and deriving information from the text data in a smart and efficient manner.
By utilizing NLP and its components, one can organize the massive chunks of text data,
perform numerous automated tasks and solve a wide range of problems such as – automatic
summarization, machine translation, named entity recognition, relationship extraction,
sentiment analysis, speech recognition, and topic segmentation etc.

Before moving further, I would like to explain some terms that are used in the article:

• Tokenization – process of converting a text into tokens


• Tokens – words or entities present in the text
• Text object – a sentence or a phrase or a word or an article

Steps to install NLTK and its data:

Install Pip: run in terminal:

sudo easy_install pip

Install NLTK: run in terminal :

sudo pip install -U nltk

Download NLTK data: run python shell (in terminal) and write the following code:

```

import nltk nltk.download() ```

Follow the instructions on screen and download the desired package or collection. Other
libraries can be directly installed using pip.
155

2. Text Preprocessing
Since, text is the most unstructured form of all the available data, various types of noise are
present in it and the data is not readily analyzable without any pre-processing. The entire
process of cleaning and standardization of text, making it noise-free and ready for analysis is
known as text preprocessing.

It is predominantly comprised of three steps:

• Noise Removal
• Lexicon Normalization
• Object Standardization

The following image shows the architecture of text preprocessing pipeline.

2.1 Noise Removal


Any piece of text which is not relevant to the context of the data and the end-output can be
specified as the noise.

For example – language stopwords (commonly used words of a language – is, am, the, of, in
etc), URLs or links, social media entities (mentions, hashtags), punctuations and industry
specific words. This step deals with removal of all types of noisy entities present in the text.

A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate
the text object by tokens (or by words), eliminating those tokens which are present in the
noise dictionary.

Following is the python code for the same purpose.


156

```

# Sample code to remove noisy words from a text

noise_list = ["is", "a", "this", "..."]

def _remove_noise(input_text):

words = input_text.split()

noise_free_words = [word for word in words if word not in noise_list]

noise_free_text = " ".join(noise_free_words)

return noise_free_text

_remove_noise("this is a sample text")

>>> "sample text"

```
157

Another approach is to use the regular expressions while dealing with special patterns of
noise. We have explained regular expressions in detail in one of our previous
article. Following python code removes a regex pattern from the input text:

```

# Sample code to remove a regex pattern

import re

def _remove_regex(input_text, regex_pattern):

urls = re.finditer(regex_pattern, input_text)

for i in urls:

input_text = re.sub(i.group().strip(), '', input_text)

return input_text

regex_pattern = "#[\w]*"

_remove_regex("remove this #hashtag from analytics vidhya", regex_pattern)

>>> "remove this from analytics vidhya"

```
158

2.2 Lexicon Normalization


Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of
the word – “play”, Though they mean different but contextually all are similar. The step
converts all the disparities of a word into their normalized form (also known as lemma).
Normalization is a pivotal step for feature engineering with text as it converts the high
dimensional features (N different features) to the low dimensional space (1 feature), which is
an ideal ask for any ML model.

The most common lexicon normalization practices are :

• Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes


(“ing”, “ly”, “es”, “s” etc) from a word.
• Lemmatization: Lemmatization, on the other hand, is an organized & step by step
procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary
importance of words) and morphological analysis (word structure and grammar
relations).

Below is the sample code that performs lemmatization and stemming using python’s
popular library – NLTK.
159

```

from nltk.stem.wordnet import WordNetLemmatizer

lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer

stem = PorterStemmer()

word = "multiplying"

lem.lemmatize(word, "v")

>> "multiply"

stem.stem(word)

>> "multipli"

```
160

2.3 Object Standardization


Text data often contains words or phrases which are not present in any standard lexical
dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs.
With the help of regular expressions and manually prepared data dictionaries, this type of
noise can be fixed, the code below uses a dictionary lookup method to replace social media
slangs from a text.
161

```

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "lu

v" :"love", "..."}

def _lookup_words(input_text):

words = input_text.split()

new_words = []

for word in words:

if word.lower() in lookup_dict:

word = lookup_dict[word.lower()]

new_words.append(word) new_text = " ".join(new_words)

return new_text

_lookup_words("RT this is a retweeted tweet by Shivam Bansal")

>> "Retweet this is a retweeted tweet by Shivam Bansal"

```
162

Apart from three steps discussed so far, other types of text preprocessing includes encoding-
decoding noise, grammar checker, and spelling correction etc. The detailed article about
preprocessing and its methods is given in one of my previous article.

3.Text to Features (Feature Engineering on text data)


To analyse a preprocessed data, it needs to be converted into features. Depending upon the
usage, text features can be constructed using assorted techniques – Syntactical Parsing,
Entities / N-grams / word-based features, Statistical features, and word embeddings. Read
on to understand these techniques in detail.

3.1 Syntactic Parsing


Syntactical parsing invol ves the analysis of words in the sentence for grammar and their
arrangement in a manner that shows the relationships among the words. Dependency
Grammar and Part of Speech tags are the important attributes of text syntactics.

Dependency Trees – Sentences are composed of some words sewed together. The
relationship among the words in a sentence is determined by the basic dependency
grammar. Dependency grammar is a class of syntactic text analysis that deals with
(labeled) asymmetrical binary relations between two lexical items (words). Every relation
can be represented in the form of a triplet (relation, governor, dependent). For example:
consider the sentence – “Bills on ports and immigration were submitted by Senator
Brownback, Republican of Kansas.” The relationship among the words can be observed in
the form of a tree representation as shown:
163

The tree shows that “submitted” is the root word of this sentence, and is linked by two sub-
trees (subject and object subtrees). Each subtree is a itself a dependency tree with relations
such as – (“Bills” <-> “ports” <by> “proposition” relation), (“ports” <-> “immigration” <by>
“conjugation” relation).

This type of tree, when parsed recursively in top-down manner gives grammar relation triplets
as output which can be used as features for many nlp problems like entity wise sentiment
analysis, actor & entity identification, and text classification. The python
wrapper StanfordCoreNLP (by Stanford NLP Group, only commercial license) and NLTK
dependency grammars can be used to generate dependency trees.

Part of speech tagging – Apart from the grammar relations, every word in a sentence is also
associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos
tags defines the usage and function of a word in the sentence. H ere is a list of all possible
pos-tags defined by Pennsylvania university. Following code using NLTK performs pos
tagging annotation on input text. (it provides several implementations, the default one is
perceptron tagger)
164

```

from nltk import word_tokenize, pos_tag

text = "I am learning Natural Language Processing on Analytics Vidhya"

tokens = word_tokenize(text)

print pos_tag(tokens)

>>> [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'),('L

anguage', 'NNP'),

('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'),('Vidhya', 'NNP')]

```

Part of Speech tagging is used for many important purposes in NLP:

A.Word sense disambiguation: Some language words have multiple meanings according
to their usage. For example, in the two sentences below:

I. “Please book my flight for Delhi”

II. “I am going to read this book in the flight”

“Book” is used with different context, however the part of speech tag for both of the cases
are different. In sentence I, the word “book” is used as v erb, while in II it is used as no un.
(Lesk Algorithm is also us ed for similar purposes)
165

B.Improving word-based features: A learning model could learn different contexts of a


word when used word as the features, however if the part of speech tag is linked with them,
the context is preserved, thus making strong features. For example:

Sentence -“book my flight, I will read this book”

Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)

Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1),
(“will_MD”, 1), (“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)

C. Normalization and Lemmatization: POS tags are the basis of lemmatization process for
converting a word to its base form (lemma).

D.Efficient stopword removal : P OS tags are also useful in efficient removal of stopwords.

For example, there are some tags which always define the low frequency / less important
words of a language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”,
“hundred”), (MD – “may”, “mu st” etc)

3.2 Entity Extraction (Entities as features)


Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases
or both. Entity Detection algorithms are generally ensemble models of rule based parsing,
dictionary lookups, pos tagging and dependency parsing. The applicability of entity detection
can be seen in the automated chat bots, content analyzers and consumer insights.

Topic Modelling & Named Entity Recognition are the two key entity detection methods in
NLP.
166

A. Named Entity Recognition (NER)

The process of detecting the named entities such as person names, location names,
company names etc from the text is called as NER. For example :

Sentence – Sergey Brin, the manager of Google Inc. is walking in the streets of New York.

Named Entities – ( “person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New
York”)

A typical NER model consists of three blocks:

Noun phrase identification: This step deals with extracting all the noun phrases from a
text using dependency parsing and part of speech tagging.

Phrase classification: This is the classification step in which all the extracted noun
phrases are classified into respective categories (locations, names etc). Google Maps API
provides a good path to disambiguate locations, Then, the open databases from dbpedia,
wikipedia can be used to identify person names or company names. Apart from this, one
can curate the lookup tables and dictionaries by combining information from different
sources.

Entity disambiguation: Sometimes it is possible that entities are misclassified, hence


creating a validation layer on top of the results is useful. Use of knowledge graphs can be
exploited for this purposes. The popular knowledge graphs are – Google Knowledge Graph,
IBM Watson and Wikipedia.

B. Topic Modeling

Topic modeling is a process of automatically identifying the topics present in a text corpus, it
derives the hidden patterns among the words in the corpus in an unsupervised manner.
Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic
model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”,
“crops”, “wheat” for a topic – “Farming”.

Latent Dirichlet Allocation (LDA) is the most popular topic modelling technique, Following is
the code to implement topic modeling using LDA in python. For a detailed explanation about
its working and implementation, check the complete article here.
167

```

doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my fa

ther."

doc2 = "My father spends a lot of time driving my sister around to dance prac

tice."

doc3 = "Doctors suggest that driving may cause increased stress and blood pre

ssure."

doc_complete = [doc1, doc2, doc3]

doc_clean = [doc.split() for doc in doc_complete]

import gensim from gensim

import corpora

# Creating the term dictionary of our corpus, where every unique term is assi

gned an index.

dictionary = corpora.Dictionary(doc_clean)
168

# Converting list of documents (corpus) into Document Term Matrix using dicti

onary prepared above.

doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Creating the object for LDA model using gensim library

Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix

ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50

# Results

print(ldamodel.print_topics())

```
169

C. N-Grams as Features

A combination of N words together are called N-Grams. N grams (N > 1) are generally more
informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are
considered as the most important features of all the others. The following code generates
bigram of a text.

```

def generate_ngrams(text, n):

words = text.split()

output = []

for i in range(len(words)-n+1):

output.append(words[i:i+n])

return output

>>> generate_ngrams('this is a sample text', 2)

# [['this', 'is'], ['is', 'a'], ['a', 'sample'], , ['sample', 'text']]

```
170

3.3 Statistical Features


Text data can also be quantified directly into numbers using several techniques described in
this section:

A. Term Frequency – Inverse Document Frequency (TF – IDF)

TF-IDF is a weighted model commonly used for information retrieval problems. It aims to
convert the text documents into vector models on the basis of occurrence of words in the
documents without taking considering the exact ordering. For Example – let say there is a
dataset of N text documents, In any document “D”, TF and IDF will be defined as –

Term Frequency (TF) – TF for a term “t” is defined as the count of a term “t” in a document
“D”

Inverse Document Frequency (IDF) – IDF for a term is defined as logarithm of ratio of
total documents available in the corpus and number of documents containing the term T.

TF . IDF – TF IDF formula gives the relative importance of a term in a corpus (list of
documents), given by the following formula below. Following is the code using python’s scikit
learn package to convert a text into tf idf vectors:
171

```

from sklearn.feature_extraction.text import TfidfVectorizer

obj = TfidfVectorizer()

corpus = ['This is sample document.', 'another random document.', 'third samp

le document text']

X = obj.fit_transform(corpus)

print X

>>>

(0, 1) 0.345205016865

(0, 4) ... 0.444514311537

(2, 1) 0.345205016865

(2, 4) 0.444514311537

```

The model creates a vocabulary dictionary and assigns an index to each word. Each row in
the output contains a tuple (i,j) and a tf-idf value of word at index j in document i.
172

B. Count / Density / Readability Features

Count or Density based features can also be used in models and analysis. These features
might seem trivial but shows a great impact in learning models. Some of the features are:
Word Count, Sentence Count, Punctuation Counts and Industry specific word counts. Other
types of measures include readability measures such as syllable counts, smog index and
flesch reading ease. Refer to Textstat library to create such features.

3.4 Word Embedding (text vectors)


Word embedding is the modern way of representing words as vectors. The aim of word
embedding is to redefine the high dimensional word features into low dimensional feature
vectors by preserving the contextual similarity in the corpus. They are widely used in deep
learning models such as Convolutional Neural Networks and Recurrent Neural Networks.

Word2Vec and GloVe are the two popular models to create word embedding of a text. These
models takes a text corpus as input and produces the word vectors as output.

Word2Vec model is composed of preprocessing module, a shallow neural network model


called Continuous Bag of Words and another shallow neural network model called skip-gram.
These models are widely used for all other nlp problems. It first constructs a vocabulary from
the training corpus and then learns word embedding representations. Following code using
gensim package prepares the word embedding as the vectors.
173

```

from gensim.models import Word2Vec

sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],

['machine', 'learning'], ['deep', 'learning']]

# train the model on your corpus

model = Word2Vec(sentences, min_count = 1)

print model.similarity('data', 'science')

>>> 0.11222489293

print model['learning']

>>> array([ 0.00459356 0.00303564 -0.00467622 0.00209638, ...])

```

They can be used as feature vectors for ML model, used to measure text similarity using
cosine similarity techniques, words clustering and text classification techniques.
174

4. Important tasks of NLP


This section talks about different use cases and problems in the field of natural language
processing.

4.1 Text Classification


Text classification is one of the classical problem of NLP. Notorious examples include – Email
Spam Identification, topic classification of news, sentiment classification and organization of
web pages by search engines.

Text classification, in common words is defined as a technique to systematically classify a


text object (document or sentence) in one of the fixed category. It is really helpful when the
amount of data is too large, especially for organizing, information filtering, and storage
purposes.

A typical natural language classifier consists of two parts: (a) Training (b) Prediction as shown
in image below. Firstly the text input is processes and features are created. The machine
learning models then learn these features and is used for predicting against the new text.
175

Here is a code that uses naive bayes classifier using text blob library (built on top of nltk).

```

from textblob.classifiers import NaiveBayesClassifier as NBC

from textblob import TextBlob

training_corpus = [

('I am exhausted of this work.', 'Class_B'),

("I can't cooperate with this", 'Class_B'),

('He is my badest enemy!', 'Class_B'),

('My management is poor.', 'Class_B'),

('I love this burger.', 'Class_A'),

('This is an brilliant place!', 'Class_A'),

('I feel very good about these dates.', 'Class_A'),

('This is my best work.', 'Class_A'),

("What an awesome view", 'Class_A'),

('I do not like this dish', 'Class_B')]


176

test_corpus = [

("I am not feeling well today.", 'Class_B'),

("I feel brilliant!", 'Class_A'),

('Gary is a friend of mine.', 'Class_A'),

("I can't believe I'm doing this.", 'Class_B'),

('The date was good.', 'Class_A'), ('I do not enjoy my job',

'Class_B')]

model = NBC(training_corpus)

print(model.classify("Their codes are amazing."))

>>> "Class_A"

print(model.classify("I don't like their computer."))

>>> "Class_B"

print(model.accuracy(test_corpus))

>>> 0.83

```
177

Scikit.Learn also provides a pipeline framework for text classification:

```

from sklearn.feature_extraction.text

import TfidfVectorizer from sklearn.metrics

import classification_report

from sklearn import svm

# preparing data for SVM model (using the same training_corpus, test_corpus f

rom naive bayes example)

train_data = []

train_labels = []

for row in training_corpus:

train_data.append(row[0])

train_labels.append(row[1])

test_data = []
178

test_labels = []

for row in test_corpus:

test_data.append(row[0])

test_labels.append(row[1])

# Create feature vectors

vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)

# Train the feature vectors

train_vectors = vectorizer.fit_transform(train_data)

# Apply model on test data

test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear

model = svm.SVC(kernel='linear')

model.fit(train_vectors, train_labels)
179

prediction = model.predict(test_vectors)

>>> ['Class_A' 'Class_A' 'Class_B' 'Class_B' 'Class_A' 'Class_A']

print (classification_report(test_labels, prediction))

```

The text classification model are heavily dependent upon the quality and quantity of features,
while applying any machine learning model it is always a good practice to include more and
more training data. H ere are some tips that I wrote about improving the text classification
accuracy in one of my previous article.
180

4.2 Text Matching / Similarity


One of the important areas of NLP is the matching of text objects to find similarities. Important
applications of text matching includes automatic spelling correction, data de-duplication and
genome analysis etc.

A number of text matching techniques are available depending upon the requirement. This
section describes the important techniques in detail.

A. Levenshtein Distance – The Levenshtein distance between two strings is defined as


the minimum number of edits needed to transform one string into the other, with the
allowable edit operations being insertion, deletion, or substitution of a single character.
Following is the implementation for efficient memory computations.
181

def levenshtein(s1,s2):

if len(s1) > len(s2):

s1,s2 = s2,s1

distances = range(len(s1) + 1)

for index2,char2 in enumerate(s2):

newDistances = [index2+1]

for index1,char1 in enumerate(s1):

if char1 == char2:

newDistances.append(distances[index1])

else:

newDistances.append(1 + min((distances[index1], distances[in

dex1+1], newDistances[-1])))

distances = newDistances

return distances[-1]

print(levenshtein("analyze","analyse"))
182

B. Phonetic Matching – A Phonetic matching algorithm takes a keyword as input (person’s


name, location name etc) and produces a character string that identifies a set of words that
are (roughly) phonetically similar. It is very useful for searching large text corpuses,
correcting spelling errors and matching relevant names. Soundex and Metaphone are two
main phonetic algorithms used for this purpose. Python’s module Fuzzy is used to compute
soundex strings for different words, for example –

```

import fuzzy

soundex = fuzzy.Soundex(4)

print soundex('ankit')

>>> “A523”

print soundex('aunkit')

>>> “A523”

```

C. Flexible String Matching – A complete text matching system includes different


algorithms pipelined together to compute variety of text variations. Regular expressions are
really helpful for this purposes as well. Another common techniques include – exact string
matching, lemmatized matching, and compact matching (takes care of spaces,
punctuation’s, slangs etc).
183

D. Cosine Similarity – W hen the text is represented as vector notation, a general cosine
similarity can also be applied in order to measure vectorized similarity. Following code
converts a text to vectors (using term frequency) and applies cosine similarity to provide
closeness among two text.

```

import math

from collections import Counter

def get_cosine(vec1, vec2):

common = set(vec1.keys()) & set(vec2.keys())

numerator = sum([vec1[x] * vec2[x] for x in common])

sum1 = sum([vec1[x]**2 for x in vec1.keys()])

sum2 = sum([vec2[x]**2 for x in vec2.keys()])

denominator = math.sqrt(sum1) * math.sqrt(sum2)

if not denominator:

return 0.0
184

else:

return float(numerator) / denominator

def text_to_vector(text):

words = text.split()

return Counter(words)

text1 = 'This is an article on analytics vidhya'

text2 = 'article on analytics vidhya is about natural language processing'

vector1 = text_to_vector(text1)

vector2 = text_to_vector(text2)

cosine = get_cosine(vector1, vector2)

>>> 0.62

```
185

4.3 Coreference Resolution


Coreference Resolution is a process of finding relational links among the words (or phrases)
within the sentences. Consider an example sentence: ” Donald went to John’s office to see
the new table. He looked at it for an hour.“

Humans can quickly figure out that “he” denotes Donald (and not John), and that “it” denotes
the table (and not John’s office). Coreference Resolution is the component of NLP that does
this job automatically. It is used in document summarization, question answering, and
information extraction. Stanford CoreNLP provides a python wrapper for commercial
purposes.

4.4 Other NLP problems / tasks


• Text Summarization – Given a text article or paragraph, summarize it automatically
to produce most important and relevant sentences in order.
• Machine Translation – Automatically translate text from one human language to
another by taking care of grammar, semantics and information about the real world,
etc.
• Natural Language Generation and Understanding – Convert information from
computer databases or semantic intents into readable human language is called
language generation. Converting chunks of text into more logical structures that are
easier for computer programs to manipulate is called language understanding.
• Optical Character Recognition – Given an image representing printed text,
determine the corresponding text.
• Document to Information – This involves parsing of textual data present in
documents (websites, files, pdfs and images) to analyzable and clean format.

5. Important Libraries for NLP (python)


• Scikit-learn: Machine learning in Python
• Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
• Pattern – A web mining module for the with tools for NLP and machine learning.
• TextBlob – Easy to use nl p tools API, built on top of NLTK and Pattern.
• spaCy – Industrial strength N LP with Python and Cython.
• Gensim – Topic Modelling for Humans
• Stanford Core NLP – NLP services and packages by Stanford NLP Group.

You might also like