Professional Documents
Culture Documents
• Business Science 11 – 14
o Problem Framework 11
o DS Python Workflow 12
o DS R Workflow 13
• Jupyter Notebook 15
• Python for Data Science 16 – 63
o Dataquest 16 – 20
§ Python Basics 16
§ Python Intermediate 17
§ NumPy 18
§ Pandas 19
§ Regular Expressions 20
o Datacamp 21 – 34
§ Importing Data 21
§ Python Basics 22
§ Numpy Basics 23
§ Pandas Basics 24
§ Pandas 25
§ Matplotlib 26
§ Seaborn 27
§ Bokeh 28
§ SciPy – Linear Algebra 29
§ PySpark – SQL Basics 30
§ PySpark – RDD Basics 31
§ Scikit – Learn 32
§ Keras 33
§ spaCY 34 – 35
o Python III 35 – 36
o Python Crash Course 38 – 63
§ Intro 38 – 39
§ Lists 40 – 41
§ Dictionaries 42 – 43
§ If Statements & While Loops 44 – 45
§ Functions 46 – 47
§ Classes 48 – 49
§ Files & Exceptions 50 – 51
§ Testing Code 52 – 53
§ Pygame 54 – 55
§ Matplotlib 56- 57
§ Pygal 58 -59
§ Django 60- 63
• Command Line 63 – 74
o Linux Bash 63 – 70
o DVC 71
o GIT 72 – 74
• SQL 75 – 80
• Math 81 – 106
95
o Probabilities & Statistics 81 – 83
o Probability 84 – 93
§ Table of Distributions 93
o Linear Algebra & Calculus 94 – 95
o Calculus 96 – 106
• Machine Learning 107 – 121
o Choosing a Model 107
o Supervised Learning 108 – 113
§ Regression 108 – 109, 110 – 112
§ Classification 110 – 113
o Unsupervised Learning 114 – 118
§ Clustering 114 – 115
§ Unsupervised Learning 116 – 118
o Tips & Tricks 119 – 121
• Deep Learning 122 – 139
o Keras 122
o Neural Network Cells 123
o Neural Network Graphs 124
o Deep Learning 125 – 126
o Recurrent Neural Networks 127 – 131
o Convolutional Neural Networks 132 – 136
o Tips & Tricks 137 – 139
• Big Data 140 – 143
o Dask 140 – 141
o PySpark 142 – 143
• Natural Language Processing 144 – 185
o NLP 144 – 146
o Text Analysis with NLTK 147 – 149
o NLP Cheat Sheet 150 – 151
o Guide to NLP 152 - 185
Data Science Cheatsheet
Compiled by Maverick Lin (http://mavericklin.com)
Last Updated August 13, 2018
2. Data-Driven: Given some data, what interesting Event E: set of outcomes of an experiment e.g. event less than arithmetic mean = n a1 a2 ...a3
p
problems can be solved with it? that a roll is 5, or the event that sum of 2 rolls is 7 Median Exact middle value among a dataset. Useful for
Probability of an Outcome s or P(s): number that skewed distribution or data with outliers.
The heart of data science is to always ask questions. Al- satisfies 2 properties Mode Most frequent element in a dataset.
ways be curious about the world. for each outcome s, 0 P(s) 1
1. What can we learn from this data? 2. p(s) = 1 Variability
2. What actions can we take once we find whatever it Standard Deviation Measures the squares di↵erences
1. P
is we are looking for? probabilities of the between the individual elements and the mean
2
outcomes of the experiment: p(E) = s⇢E p(s) i=1 (xi x)
= N 1
Types of Data Random Variable V: numerical function on the out-
q PN
Variance V = 2
Probability of Event E: sum of theP
Independence, Conditional, Compound of the same event due to random noise/error. Variance
in tables. e.g. blobs of text, images, audio
Independent Events: A and B are independent i↵: can be explained away by attributing to sampling or
Quantitative Data: Numerical. e.g. height, weight
Categorical Data: Data that can be labeled or divided
P(A \ B) = P(A)P(B) measurement errors. Other times, the variance is due to
P(A|B) = P(A) the random fluctuations of the universe.
into groups. e.g. race, sex, hair color.
P(B|A) = P(B)
Big Data: Massive datasets, or data that contains
; Conditional Probability: P(A|B) = P(A,B)/P(B) Correlation Analysis
greater variety arriving in increasing volumes and with
Bayes Theorem: P(A|B) = P(B|A)P(A)/P(B) Correlation coefficients r(X,Y) is a statistic that measures
ever-higher velocity (3 Vs). Cannot fit in the memory of
Joint Probability: P(A,B) = P(B|A)P(A) the degree that Y is a function of X and vice versa.
a single machine.
Marginal Probability: P(A) Correlation values range from -1 to 1, where 1 means
Data Sources/Fomats fully correlated, -1 means negatively-correlated, and 0
Probability Distributions means no correlation.
Most Common Data Formats CSV, XML, SQL,
Probability Density Function (PDF) Gives the prob- Pearson Coefficient Measures the degree of the rela-
JSON, Protocol Bu↵ers
ability that a rv takes on the value x: pX (x) = P (X = x) tionship between linearly related variables
Data Sources Companies/Proprietary Data, APIs, Gov-
Cumulative Density Function (CDF) Gives the prob- )
ernment, Academic, Web Scraping/Crawling r = Cov(X,Y
(X) (Y )
ability that a random variable is less than or equal to x: Spearman Rank Coefficient Computed on ranks and
Main Types of Problems
FX (x) = P (X x) depicts monotonic relationships
Note: The PDF and the CDF of a given random variable
Two problems arise repeatedly in data science. contain exactly the same information. Note: Correlation does not imply causation!
Classification: Assigning something to a discrete set of
possibilities. e.g. spam or non-spam, Democrat or Repub-
lican, blood type (A, B, AB, O)
Regression: Predicting a numerical value. e.g. some-
one’s income, next year GDP, stock price
1
Data Cleaning Feature Engineering Statistical Analysis
Data Cleaning is the process of turning raw data into Feature engineering is the process of using domain knowl- Process of statistical reasoning: there is an underlying
a clean and analyzable data set. ”Garbage in, garbage edge to create features or input variables that help ma- population of possible things we can potentially observe
out.” Make sure garbage doesn’t get put in. chine learning algorithms perform better. Done correctly, and only a small subset of them are actually sampled (ide-
it can help increase the predictive power of your models. ally at random). Probability theory describes what prop-
Errors vs. Artifacts Feature engineering is more of an art than science. FE is erties our sample should have given the properties of the
1. Errors: information that is lost during acquisi- one of the most important steps in creating a good model. population, but statistical inference allows us to deduce
tion and can never be recovered e.g. power outage, As Andrew Ng puts it: what the full population is like after analyzing the sample.
crashed servers
“Coming up with features is difficult, time-consuming,
2. Artifacts: systematic problems that arise from
requires expert knowledge. ‘Applied machine learning’ is
the data cleaning process. these problems can be
basically feature engineering.”
corrected but we must first discover them
Continuous Data
Data Compatibility Raw Measures: data that hasn’t been transformed yet
Data compatibility problems arise when merging datasets. Rounding: sometimes precision is noise; round to
Make sure you are comparing ”apples to apples” and nearest integer, decimal etc..
not ”apples to oranges”. Main types of conver- Scaling: log, z-score, minmax scale
sions/unifications: Imputation: fill in missing values using mean, median, Sampling From Distributions
• units (metric vs. imperial) model output, etc.. Inverse Transform Sampling Sampling points from
• numbers (decimals vs. integers), Binning: transforming numeric features into categorical a given probability distribution is sometimes necessary
• names (John Smith vs. Smith, John), ones (or binned) e.g. values between 1-10 belong to A, to run simulations or whether your data fits a particular
• time/dates (UNIX vs. UTC vs. GMT), between 10-20 belong to B, etc. distribution. The general technique is called inverse
• currency (currency type, inflation-adjusted, divi- Interactions: interactions between features: e.g. sub- transform sampling or Smirnov transform. First draw
dends) traction, addition, multiplication, statistical test a random number p between [0,1]. Compute value x
Statistical: log/power transform (helps turn skewed such that the CDF equals p: FX (x) = p. Use x as the
Data Imputation distributions more normal), Box-Cox value to be the random value drawn from the distribution
Process of dealing with missing values. The proper meth- Row Statistics: number of NaN’s, 0’s, negative values, described by FX (x).
ods depend on the type of data we are working with. Gen- max, min, etc
eral methods include: Dimensionality Reduction: using PCA, clustering, Monte Carlo Sampling In higher dimensions, correctly
• Drop all records containing missing data factor analysis etc sampling from a given distribution becomes more tricky.
• Heuristic-Based: make a reasonable guess based on Generally want to use Monte Carlo methods, which
knowledge of the underlying domain Discrete Data typically follow these rules: define a domain of possible
• Mean Value: fill in missing data with the mean Encoding: since some ML algorithms cannot work on inputs, generate random inputs from a probability
• Random Value categorical data, we need to turn categorical data into nu- distribution over the domain, perform a deterministic
• Nearest Neighbor: fill in missing data using similar merical data or vectors calculation, and analyze the results.
data points Ordinal Values: convert each distinct feature into a ran-
• Interpolation: use a method like linear regression to dom number (e.g. [r,g,b] becomes [1,2,3])
predict the value of the missing data One-Hot Encoding: each of the m features becomes a
vector of length m with containing only one 1 (e.g. [r, g,
Outlier Detection b] becomes [[1,0,0],[0,1,0],[0,0,1]])
Outliers can interfere with analysis and often arise from Feature Hashing Scheme: turns arbitrary features into
mistakes during data collection. It makes sense to run a indices in a vector or matrix
”sanity check”. Embeddings: if using words, convert words to vectors
(word embeddings)
Miscellaneous
Lowercasing, removing non-alphanumeric, repairing,
unidecode, removing unknown characters
data.
Classic Statistical Distributions Modeling- Overview Modeling- Philosophies
Binomial Distribution (Discrete) Modeling is the process of incorporating information into Modeling is the process of incorporating information
Assume X is distributed Bin(n,p). X is the number of a tool which can forecast and make predictions. Usually, into a tool which can forecast and make predictions.
”successes” that we will achieve in n independent trials, we are dealing with statistical modeling where we want Designing and validating models is important, as well as
where each trial is either a success or failure and each to analyze relationships between variables. Formally, we evaluating the performance of models. Note that the best
success occurs with the same probability p and each want to estimate a function f (X) such that: forecasting model may not be the most accurate one.
failure occurs with probability q=1-p.
Y = f (X) + ✏
PDF: P (X = x) = nk px (1 p)n x Philosophies of Modeling
EV: µ = np Variance = npq where X = (X1 , X2 , ...Xp ) represents the input variables, Occam’s Razor Philosophical principle that the simplest
Y represents the output variable, and ✏ represents random explanation is the best explanation. In modeling, if we
Normal/Gaussian Distribution (Continuous) error. are given two models that predicts equally well, we should
Assume X in distributed N (µ, 2 ). It is a bell-shaped choose the simpler one. Choosing the more complex one
and symmetric distribution. Bulk of the values lie close Statistical learning is set of approaches for estimating can often result in overfitting.
to the mean and no value is too extreme. Generalization this f (X). Bias Variance Trade-O↵ Inherent part of predictive
of the binomial distribution2 as 2n ! 1. modeling, where models with lower bias will have higher
PDF: P (x) = p12⇡ e (x µ) /2 Why Estimate f(X)? variance and vice versa. Goal is to achieve low bias and
EV: µ Variance: 2 Prediction: once we have a good estimate fˆ(X), we can low variance.
Implications: 68%-95%-99% rule. 68% of probability use it to make predictions on new data. We treat fˆ as a • Bias: error from incorrect assumptions to make tar-
mass fall within 1 of the mean, 95% within 2 , and black box, since we only care about the accuracy of the get function easier to learn (high bias ! missing rel-
99.7% within 3 . predictions, not why or how it works. evant relations or underfitting)
Inference: we want to understand the relationship • Variance: error from sensitivity to fluctuations in
Poisson Distribution (Discrete) between X and Y. We can no longer treat fˆ as a black the dataset, or how much the target estimate would
Assume X is distributed Pois( ). Poisson expresses box since we want to understand how Y changes with di↵er if di↵erent training data was used (high vari-
the probability of a given number of events occurring respect to X = (X1 , X2 , ...Xp ) ance ! modeling noise or overfitting)
in a fixed interval of time/space if these events occur
independently and with a known constant rate . More About ✏
x
PDF: P (x) = e x! EV: Variance = The error term ✏ is composed of the reducible and irre-
ducible error, which will prevent us from ever obtaining a
Power Law Distributions (Discrete) perfect fˆ estimate.
Many data distributions have much longer tails than • Reducible: error that can potentially be reduced
the normal or Poisson distributions. In other words, by using the most appropriate statistical learning
the change in one quantity varies as a power of another technique to estimate f . The goal is to minimize
quantity. It helps measure the inequality in the world. the reducible error.
e.g. wealth, word frequency and Pareto Principle (80/20 • Irreducible: error that cannot be reduced no
No Free Lunch Theorem No single machine learning
Rule) matter how well we estimate f . Irreducible error is
algorithm is better than all the others on all problems.
PDF: P(X=x) = cx ↵ , where ↵ is the law’s exponent unknown and unmeasurable and will always be an
It is common to try multiple models and find one that
and c is the normalizing constant upper bound for ✏.
works best for a particular problem.
Note: There will always be trade-o↵s between model
Thinking Like Nate Silver
flexibility (prediction) and model interpretability (infer-
1. Think Probabilistically Probabilistic forecasts are
ence). This is just another case of the bias-variance trade-
more meaningful than concrete statements and should be
o↵. Typically, as flexibility increases, interpretability de-
reported as probability distributions (including along
creases. Much of statistical learning/modeling is finding a
with mean prediction µ.
way to balance the two.
2. Incorporate New Information Use live models,
which continually updates using new information. To up-
date, use Bayesian reasoning to calculate how probabilities
change in response to new evidence.
3. Look For Consensus Forecast Use multiple distinct
sources of evidence. Ssome models operate this way, such
as boosting and bagging, which uses large number of weak
classifiers to produce a strong one.
3
Modeling- Taxonomy Modeling- Evaluation Metrics Modeling- Evaluation Environment
There are many di↵erent types of models. It is important Need to determine how good our model is. Best way to Evaluation metrics provides use with the tools to estimate
to understand the trade-o↵s and when to use a certain assess models is out-of-sample predictions (data points errors, but what should be the process to obtain the
type of model. your model has never seen). best estimate? Resampling involves repeatedly drawing
samples from a training set and refitting a model to each
Parametric vs. Nonparametric Classification sample, which provides us with additional information
• Parametric: models that first make an assumption Predicted Yes Predicted No
compared to fitting the model once, such as obtaining a
about a function form, or shape, of f (linear). Then Actual Yes True Positives (TP) False Negatives (FN) better estimate for the test error.
fits the model. This reduces estimating f to just Actual No False Positives (FP) True Negatives (TN)
estimating set of parameters, but if our assumption Key Concepts
Accuracy: ratio of correct predictions over total pre-
was wrong, will lead to bad results. Training Data: data used to fit your models or the set
dictions. Misleading when class sizes are substantially
• Non-Parametric: models that don’t make any as- P +T N used for learning
di↵erent. accuracy = T P +TT N +F N +F P
sumptions about f , which allows them to fit a wider Validation Data: data used to tune the parameters of
Precision: how often the classifier is correct when it
range of shapes; but may lead to overfitting P a model
predicts positive: precision = T PT+F P
Supervised vs. Unsupervised Test Data: data used to evaluate how good your model
Recall: how often the classifier is correct for all positive
• Supervised: models that fit input variables xi = P is. Ideally your model should never touch this data until
instances: recall = T PT+F N
(x1 , x2 , ...xn ) to a known output variables yi = final testing/evaluation
F-Score: single measurement to describe performance:
(y1 , y2 , ...yn ) precision·recall
Cross Validation
F = 2 · precision + recall
ROC Curves: plots true positive rates and false pos-
• Unsupervised: models that take in input variables
xi = (x1 , x2 , ...xn ), but they do not have an asso- Class of methods that estimate test error by holding out
itive rates for various thresholds, or where the model
ciated output to supervise the training. The goal a subset of training data from the fitting process.
determines if a data point is positive or negative (e.g. if
is understand relationships between the variables or Validation Set: split data into training set and valida-
>0.8, classify as positive). Best possible area under the
observations. tion set. Train model on training and estimate test error
ROC curve (AUC) is 1, while random is 0.5, or the main
Blackbox vs. Descriptive using validation. e.g. 80-20 split
diagonal line.
• Blackbox: models that make decisions, but we do Leave-One-Out CV (LOOCV): split data into
not know what happens ”under the hood” e.g. deep training set and validation set, but the validation set
Regression
learning, neural networks consists of 1 observation. Then repeat n-1 times until all
Errors are defined as the di↵erence between a prediction
observations have been used as validation. Test erro is
y0 and the actual result y.
• Descriptive: models that provide insight into why
they make their decisions e.g. linear regression, de- the average of these n test error estimates.
Absolute Error: = y0 y
cision trees k-Fold CV: randomly divide data into k groups (folds) of
Squared Error: 2 = (y0 y)2
First-Principle vs. Data-Driven approximately equal size. First fold is used as validation
Mean-Squared Error: M SE = n1 n i=1 (y0 yi ) 2
• First-Principle: models based on a prior belief of pi and the rest as training. Then repeat k times and find
how the system under investigation works, incorpo- Root Mean-Squared Error: RMSD = M SE average of the k estimates.
P
rates domain knowledge (ad-hoc) Absolute Error Distribution: Plot absolute error dis-
• Data-Driven: models based on observed correla- tribution: should be symmetric, centered around 0, bell- Bootstrapping
tions between input and output variables shaped, and contain rare extreme outliers. Methods that rely on random sampling with replacement.
Deterministic vs. Stochastic Bootstrapping helps with quantifying uncertainty associ-
• Deterministic: models that produce a single ”pre- ated with a given estimate or model.
diction” e.g. yes or no, true or false
• Stochastic: models that produce probability distri- Amplifying Small Data Sets
butions over possible events What can we do it we don’t have enough data?
Flat vs. Hierarchical • Create Negative Examples- e.g. classifying pres-
• Flat: models that solve problems on a single level, idential candidates, most people would be unquali-
no notion of subproblems fied so label most as unqualified
• Hierarchical: models that solve several di↵erent • Synthetic Data- create additional data by adding
nested subproblems noise to the real data
4
Linear Regression Linear Regression II Logistic Regression
Linear regression is a simple and useful tool for predicting Improving Linear Regression Logistic regression is used for classification, where the
a quantitative response. The relationship between input Subset/Feature Selection: approach involves identify- response variable is categorical rather than numerical.
variables X = (X1 , X2 , ...Xp ) and output variable Y takes ing a subset of the p predictors that we believe to be best
the form: related to the response. Then we fit model using the re- The model works by predicting the probability that Y be-
duced set of variables. longs to a particular category by first fitting the data to a
Y ⇡ 0 + 1 X1 + ... + p Xp +✏ • Best, Forward, and Backward Subset Selection linear regression model, which is then passed to the logis-
Shrinkage/Regularization: all variables are used, but tic function (below). The logistic function will always pro-
0 ... p are the unknown coefficients (parameters) which
estimated coefficients are shrunken towards zero relative duce a S-shaped curve, so regardless of X, we can always
we are trying to determine. The best coefficients
to the least squares estimate. represents the tuning obtain a sensible answer (between 0 and 1). If the prob-
will lead us to the best ”fit”, which can be found by
parameter- as ability is above a certain predetermined threshold (e.g.
minimizing the residual sum squares (RSS), or the
increases, flexibility decreases ! de-
creased variance but increased bias. The tuning parameter P(Yes) > 0.5), then the model will predict Yes.
is key in determining the sweet spot between under and e 0 + 1 X1 +...+ p Xp
the predicted ith value. RSS = n i=1 ei , where ei = yi yˆi p(X) =
over-fitting. In addition, while Ridge will always produce 1+e 0 + 1 X1 +...+ p Xp
a model with p variables, Lasso can force coefficients to
sum of the di↵erences betweenPthe actual ith value and
matrix is computationally expensive. M-dimensional subspace, where M < p. This is achieved its observed in a certain class and close to zero otherwise.
Gradient Descent: First-order optimization algorithm. by computing M di↵erent linear combinations of the This is done by maximizing the likelihood function:
We can find the minimum of a convex function by variables. Can use PCA.
starting at an arbitrary point and repeatedly take steps Miscellaneous: Removing outliers, feature scaling, l( 0 , 1 ) = p(xi ) (1 p(xi ))
removing multicollinearity (correlated variables) i:yi =1 i0 :yi0 =1
in the downward direction, which can be found by taking
Y Y
learning rate ↵ determines the size of the steps we take R2 : Measure of fit that represents the proportion of clude forcing balanced data by removing observations from
in the downward direction. variance explained, or the variability in Y that can be the larger class, replicate data from the smaller class, or
explained using X. It takes on a value between 0 and 1. heavily weigh the training examples toward instances of
Gradient descent algorithm in two dimensions. Repeat RSS
Generally the higher the better. R2 = 1 T SS
, where the larger class.
until convergence. 2
Total Sum of Squares (TSS) = (yi ȳ) Multi-Class Classification: the more classes you try to
@
1. w0t+1 := w0t ↵ @w 0
J(w0 , w1 ) predict, the harder it will be for the the classifier to be ef-
t+1 t @
P
2. w1 := w1 ↵ @w J(w0 , w1 ) Evaluating Coefficient Estimates fective. It is possible with logistic regression, but another
1
Standard Error (SE) of the coefficients can be used to per- approach, such as Linear Discriminant Analysis (LDA),
For non-convex functions, gradient descent no longer guar- form hypothesis tests on the coefficients: may prove better.
antees an optimal solutions since there may be local min- H0 : No relationship between X and Y, Ha : Some rela-
imas. Instead, we should run the algorithm from di↵erent tionship exists. A p-value can be obtained and can be
starting points and use the best local minima we find for interpreted as follows: a small p-value indicates that a re-
the solution. lationship between the predictor (X) and the response (Y)
Stochastic Gradient Descent: instead of taking a step exists. Typical p-value cuto↵s are around 5 or 1 %.
after sampling the entire training set, we take a small
batch of training data at random to determine our next
step. Computationally more efficient and may lead to
faster convergence.
5
Distance/Network Methods Nearest Neighbor Classification Clustering
Interpreting examples as points in space provides a way Distance functions allow us to identify the points closest Clustering is the problem of grouping points by sim-
to find natural groupings or clusters among data e.g. to a given target, or the nearest neighbors (NN) to a ilarity using distance metrics, which ideally reflect the
which stars are the closest to our sun? Networks can also given point. The advantages of NN include simplicity, similarities you are looking for. Often items come from
be built from point sets (vertices) by connecting related interpretability and non-linearity. logical ”sources” and clustering is a good way to reveal
points. those origins. Perhaps the first thing to do with any
k-Nearest Neighbors data set. Possible applications include: hypothesis
Measuring Distances/Similarity Measure Given a positive integer k and a point x0 , the KNN development, modeling over smaller subsets of data, data
There are several ways of measuring distances between classifier first identifies k points in the training data reduction, outlier detection.
points a and b in d dimensions- with closer distances most similar to x0 , then estimates the conditional
implying similarity. probability of x0 being in class j as the fraction of K-Means Clustering
the k points whose values belong to j. The opti- Simple and elegant algorithm to partition a dataset into
d mal value for k can be found using cross validation. K distinct, non-overlapping clusters.
Minkowski Distance Metric: dk (a, b) = k i=1 |ai bi |k
The parameter k provides a way to tradeo↵ between the 1. Choose a K. Randomly assign a number between 1
qP
largest and the total dimensional di↵erence. In other and K to each observation. These serve as initial
words, larger values of k place more emphasis on large cluster assignments
di↵erences between feature values than smaller values. Se- 2. Iterate until cluster assignments stop changing
lecting the right k can significantly impact the the mean- (a) For each of the K clusters, compute the cluster
ingfulness of your distance function. The most popular centroid. The kth cluster centroid is the vector
values are 1 and 2. of the p feature means for the observations in
the kth cluster.
KNN Algorithm
• Manhattan (k=1): city block distance, or the sum
of the absolute di↵erence between two points (b) Assign each observation to the cluster whose
1. Compute distance D(a,b) from point b to all points centroid is closest (where closest is defined us-
2. Select k closest points and their labels
• Euclidean (k=2): straight line distance
ing distance metric).
3. Output class with most frequent labels in k points Since the results of the algorithm depends on the initial
Optimizing KNN random assignments, it is a good idea to repeat the
Comparing a query point a in d dimensions against n train- algorithm from di↵erent random initializations to obtain
ing examples computes with a runtime of O(nd), which the best overall results. Can use MSE to determine which
can cause lag as points reach millions or billions. Popular cluster assignment is better.
choices to speed up KNN include:
d Hierarchical Clustering
Weighted Minkowski: dk (a, b) = k
• Vernoi Diagrams: partitioning plane into regions
i=1 wi |ai based on distance to points in a specific subset of
bi |k , in
Alternative clustering algorithm that does not require us
some scenarios, not all dimensions are equal. Can convey the plane
qP
ity distributions by measuring the uncertainty gained or a and b are close to each other, and h(a)!= h(b) if of clusters that are least dissimilar ( most simi-
uncertainty lost when replacing distribution A with dis- they are far apart. lar). Fuse these two clusters. The dissimilarity
tribution B. However, this is not a metric but forms the
between these two clusters indicates height in
basis for the Jensen-Shannon Divergence Metric.
dendrogram where fusion should be placed.
(b) Assign each observation to the cluster whose
Jensen-Shannon: JS(A, B) = 12 KL(A||M )+ 12 KL(M ||B),
where M is the average of A and B. The JS function is the
centroid is closest (where closest is defined us-
right metric for calculating distances between probability
ing distance metric).
distributions
Linkage: Complete (max dissimilarity), Single (min), Av-
erage, Centroid (between centroids of cluster A and B)
6
Machine Learning Part I Machine Learning Part II Machine Learning Part III
Comparing ML Algorithms Decision Trees Support Vector Machines
Power and Expressibility: ML methods di↵er in terms Binary branching structure used to classify an arbitrary Work by constructing a hyperplane that separates
of complexity. Linear regression fits linear functions while input vector X. Each node in the tree contains a sim- points between two classes. The hyperplane is de-
NN define piecewise-linear separation boundaries. More ple feature comparison against some field (xi > 42?). termined using the maximal margin hyperplane, which
complex models can provide more accurate models, but Result of each comparison is either true or false, which is the hyperplane that is the maximum distance from
at the risk of overfitting. determines if we should proceed along to the left or the training observations. This distance is called
Interpretability: some models are more transparent right child of the given node. Also known as some- the margin. Points that fall on one side of the
and understandable than others (white box vs. black box times called classification and regression trees (CART). hyperplane are classified as -1 and the other +1.
models)
Ease of Use: some models feature few parame-
ters/decisions (linear regression/NN), while others
require more decision making to optimize (SVMs)
Training Speed: models di↵er in how fast they fit the
necessary parameters
Prediction Speed: models di↵er in how fast they make
predictions given a query
Advantages: Non-linearity, support for categorical
variables, easy to interpret, application to regression. Principal Component Analysis (PCA)
Disadvantages: Prone to overfitting, instable (not Principal components allow us to summarize a set of
robust to noise), high variance, low bias correlated variables with a smaller set of variables that
collectively explain most of the variability in the original
Note: rarely do models just use one decision tree. set. Essentially, we are ”dropping” the least important
Instead, we aggregate many decision trees using methods feature variables.
like ensembling, bagging, and boosting.
Naive Bayes Principal Component Analysis is the process by
Naive Bayes methods are a set of supervised learning Ensembles, Bagging, Random Forests, Boosting which principal components are calculated and the use
algorithms based on applying Bayes’ theorem with the Ensemble learning is the strategy of combining many of them to analyzing and understanding the data. PCA
”naive” assumption of independence between every pair di↵erent classifiers/models into one predictive model. It is an unsupervised approach and is used for dimensional-
of features. revolves around the idea of voting: a so-called ”wisdom of ity reduction, feature extraction, and data visualization.
crowds” approach. The most predicted class will be the Variables after performing PCA are independent. Scal-
Problem: Suppose we need to classify vector X = x1 ...xn final prediction. ing variables is also important while performing PCA.
into m classes, C1 ...Cm . We need to compute the proba- Bagging: ensemble method that works by taking B boot-
bility of each possible class given X, so we can assign X strapped subsamples of the training data and constructing
the label of the class with highest probability. We can B trees, each tree training on a distinct subsample as
calculate a probability using the Bayes’ Theorem: Random Forests: builds on bagging by decorrelating
P (X|Ci )P (Ci ) the trees. We do everything the same like in bagging, but
P (Ci |X) =
P (X) when we build the trees, everytime we consider a split, a
random sample of the p predictors is chosen as split can-
Where: p
didates, not the full set (typically m ⇡ p). When m =
1. P (Ci ): the prior probability of belonging to class i p, then we are just doing bagging.
2. P (X): normalizing constant, or probability of seeing Boosting: the main idea is to improve our model where
the given input vector over all possible input vectors it is not performing well by using information from previ-
3. P (X|Ci ): the conditional probability of seeing ously constructed classifiers. Slow learner. Has 3 tuning
input vector X given we know the class is Ci parameters: number of classifiers B, learning parameter ,
interaction depth d (controls interaction order of model).
The prediction model will formally look like:
P (X|Ci )P (Ci )
C(X) = argmaxi2classes(t) P (X)
probability distributions
derstanding high level features
Deep Learning Part III Big Data- Hadoop Overview Big Data- Hadoop Ecosystem
Deep Learning Terminology and Concepts Data can no longer fit in memory on one machine An entire ecosystem of tools have emerged around
(monolithic), so a new way of computing was devised Hadoop, which are based on interacting with HDFS.
Neuron: node in a NN, typically taking in multiple in- using a group of computers to process this ”big data” Below are some popular ones:
put values and generating one output value, calculates the (distributed). Such a group is called a cluster, which
output value by applying an activation function (nonlin- makes up server farms. All of these servers have to be Hive: data warehouse software built o top of Hadoop that
ear transformation) to a weighted sum of input values coordinated in the following ways: partition data, coor- facilitates reading, writing, and managing large datasets
Weights: edges in a NN, the goal of training is to deter- dinate computing tasks, handle fault tolerance/recovery, residing in distributed storage using SQL-like queries
mine the optimal weight for each feature; if weight = 0, and allocate capacity to process. (HiveQL). Hive abstracts away underlying MapReduce
corresponding feature does not contribute jobs and returns HDFS in the form of tables (not HDFS).
Neural Network: composed of neurons (simple building Hadoop Pig: high level scripting language (Pig Latin) that
blocks that actually “learn”), contains activation functions Hadoop is an open source distributed processing frame- enables writing complex data transformations. It pulls
that makes it possible to predict non-linear outputs work that manages data processing and storage for big unstructured/incomplete data from sources, cleans it, and
Activation Functions: mathematical functions that in- data applications running in clustered systems. It is com- places it in a database/data warehouses. Pig performs
troduce non-linearity to a network e.g. RELU, tanh prised of 3 main components: ETL into data warehouse while Hive queries from data
Sigmoid Function: function that maps very negative • Hadoop Distributed File System (HDFS): warehouse to perform analysis (GCP: DataFlow).
numbers to a number very close to 0, huge numbers close a distributed file system that provides high- Spark: framework for writing fast, distributed programs
to 1, and 0 to .5. Useful for predicting probabilities throughput access to application data by partition- for data processing and analysis. Spark solves similar
Gradient Descent/Backpropagation: fundamental ing data across many machines problems as Hadoop MapReduce but with a fast in-
loss optimizer algorithms, of which the other optimizers • YARN: framework for job scheduling and cluster memory approach. It is an unified engine that supports
are usually based. Backpropagation is similar to gradient resource management (task coordination) SQL queries, streaming data, machine learning and
descent but for neural nets • MapReduce: YARN-based system for parallel graph processing. Can operate separately from Hadoop
Optimizer: operation that changes the weights and bi- processing of large data sets on multiple machines but integrates well with Hadoop. Data is processed
ases to reduce loss e.g. Adagrad or Adam using Resilient Distributed Datasets (RDDs), which are
Weights / Biases: weights are values that the input fea- HDFS immutable, lazily evaluated, and tracks lineage. ;
tures are multiplied by to predict an output value. Biases Each disk on a di↵erent machine in a cluster is comprised Hbase: non-relational, NoSQL, column-oriented
;
are the value of the output given a weight of 0. of 1 master node and the rest are workers/data nodes. database management system that runs on top of
Converge: algorithm that converges will eventually reach The master node manages the overall file system by HDFS. Well suited for sparse data sets (GCP: BigTable)
an optimal answer, even if very slowly. An algorithm that storing the directory structure and the metadata of the Flink/Kafka: stream processing framework. Batch
doesn’t converge may never reach an optimal answer. files. The data nodes physically store the data. Large streaming is for bounded, finite datasets, with periodic
Learning Rate: rate at which optimizers change weights files are broken up and distributed across multiple ma- updates, and delayed processing. Stream processing
and biases. High learning rate generally trains faster but chines, which are also replicated across multiple machines is for unbounded datasets, with continuous updates,
risks not converging, whereas a lower rate trains slower to provide fault tolerance. and immediate processing. Stream data and stream
Numerical Instability: issues with very large/small val- processing must be decoupled via a message queue.
ues due to limits of floating point numbers in computers MapReduce Can group streaming data (windows) using tumbling
Embeddings: mapping from discrete objects, such as Parallel programming paradigm which allows for process- (non-overlapping time), sliding (overlapping time), or
words, to vectors of real numbers. useful because classi- ing of huge amounts of data by running processes on mul- session (session gap) windows.
fiers/neural networks work well on vectors of real numbers tiple machines. Defining a MapReduce job requires two Beam: programming model to define and execute data
Convolutional Layer: series of convolutional opera- stages: map and reduce. processing pipelines, including ETL, batch and stream
tions, each acting on a di↵erent slice of the input matrix • Map: operation to be performed in parallel on small (continuous) processing. After building the pipeline,
Dropout: method for regularization in training NNs, portions of the dataset. the output is a key-value it is executed by one of Beam’s distributed processing
works by removing a random selection of some units in pair < K, V > back-ends (Apache Apex, Apache Flink, Apache Spark,
a network layer for a single gradient step • Reduce: operation to combine the results of Map and Google Cloud Dataflow). Modeled as a Directed
Early Stopping: method for regularization that involves Acyclic Graph (DAG).
ending model training early YARN- Yet Another Resource Negotiator Oozie: workflow scheduler system to manage Hadoop
Gradient Descent: technique to minimize loss by com- Coordinates tasks running on the cluster and assigns new jobs
puting the gradients of loss with respect to the model’s nodes in case of failure. Comprised of 2 subcomponents: Sqoop: transferring framework to transfer large amounts
parameters, conditioned on training data the resource manager and the node manager. The re- of data into HDFS from relational databases (MySQL)
Pooling: Reducing a matrix (or matrices) created by an source manager runs on a single master node and sched-
earlier convolutional layer to a smaller matrix. Pooling ules tasks across nodes. The node manager runs on all
usually involves taking either the maximum or average other nodes and manages tasks on the individual node.
value across the pooled area
9
SQL Part I Python- Data Structures Recommended Resources
Structured Query Language (SQL) is a declarative Data structures are a way of storing and manipulating • Data Science Design Manual
language used to access & manipulate data in databases. data and each data structure has its own strengths and (www.springer.com/us/book/9783319554433)
Usually the database is a Relational Database Man- weaknesses. Combined with algorithms, data structures • Introduction to Statistical Learning
agement System (RDBMS), which stores data arranged allow us to efficiently solve problems. It is important to (www-bcf.usc.edu/~gareth/ISL/)
in relational database tables. A table is arranged in know the main types of data structures that you will need • Probability Cheatsheet
columns and rows, where columns represent character- to efficiently solve problems. (/www.wzchen.com/probability-cheatsheet/)
istics of stored data and rows represent actual data entries. • Google’s Machine Learning Crash Course
Lists: or arrays, ordered sequences of objects, mutable (developers.google.com/machine-learning/
Basic Queries crash-course/)
>>> l = [42, 3.14, "hello","world"]
- filter columns: SELECT col1, col3... FROM table1
- filter the rows: WHERE col4 = 1 AND col5 = 2 Tuples: like lists, but immutable
- aggregate the data: GROUP BY. . .
- limit aggregated data: HAVING count(*) > 1 >>> t = (42, 3.14, "hello","world")
- order of the results: ORDER BY col2 Dictionaries: hash tables, key-value pairs, unsorted
Learn to apply the BSPF in Data Science For Business With R (DS4B 201-R) Version 1.0
Data Science with
Python Workflow
If you want to learn Python and this workflow for business
analysis, take the Python For Business Analysis (DS4B
101-P) course through Business Science University.
Click the links for
Documentation
CS = Cheat Sheet
matplotlib
seaborn
Pandas
text
Visualize
time series
Pandas categorical
(CS) missing
Transform
Import Tidy Communicate
JupyterLab
Data Web Pandas Jupyter Notebook
I/O tools Beautiful Soup data structures Model
Django
Scrappy group by Flask
joins
reshape (pivot)
ScikitLearn
Featuretools
Jupyter & Pycharm
Important Resources
Anaconda Distribution: https://www.anaconda.com/download/
Python Documentation: https://docs.python.org/
Python Standard Library: https://docs.python.org/3/library
Business Science University
"Data Science Education for the Enterprise" university.businessscience.io
12
version: 0.0
Data Science with R
Workflow
The Data Science With R Workflow is available in the book: R
For Data Science. If you want to learn R and this
workflow for business analysis, take the R For Business
Analysis (DS4B 101-R) course through Business Science
University. Click the links for
Documentation
ggplot2 (CS)
dplyr (CS) Visualize
stringr (CS)
lubridate (CS)
forcats
Base R (CS)
Transform purrr (CS)
Import Tidy (iteration) Communicate
readr (CS)
readxl / writexl tibble (CS) RMarkdown (CS)
Model
odbc / DBI tidyr (CS) Shiny (CS)
rvest
recipes broom
rsample yardstick
RStudio IDE (CS) fs (file system) parsnip dials
CS = Cheat Sheet
Important Resources
R For Data Science Book: http://r4ds.had.co.nz/
Rmarkdown Book: https://bookdown.org/yihui/rmarkdown/
Data Visualization Book: https://rkabacoff.github.io/datavis/
More Cheatsheets: https://www.rstudio.com/resources/cheatsheets/
tidyverse packages: https://www.tidyverse.org/
Connecting to databases: https://db.rstudio.com/
RMarkdown website: https://rmarkdown.rstudio.com/
Shiny web applications website: http://shiny.rstudio.com/
Jenny Bryan's purrr tutorial: https://jennybryan.org/
Business Science University
"Data Science Education for the Enterprise" university.businessscience.io
13
version: 1.1
Data Science with Text Analysis & NLP Machine Learning
Special Topics Multi-Threaded/Scalable/Production ML:
Text Mining with R (Book): tidytext
NLP: H2O (CS)
H2O word2vec: Word embeddings Extreme Gradient Boosting: xgboost
text2vec: fast vectorization, topic modeling R + Spark: sparklyr (CS)
udpipe: UDPipe C++ lib in R Sparkling Water (Spark + H2O): rsparkling
Time Series Analysis ML (Tidy): parsnip
ML: caret (CS)
Time-aware tibbles: tibbletime & tsibble
Convert between classes: timetk & tsbox Network Analysis
Time Series Index Summary: timetk Deep Learning
Generating Future Series: timetk Network Data Transformations (Tidy): tidygraph
Network Data Transformations: igraph R Interface to TensorFlow Homepage:
Keras (CS)
Forecasting TF Estimators
Network Viz TensorFlow (Core)
ARIMA, ETS, etc: forecast & fable
Static:
Tidy, glance, augment for forecast models: sweep
ggraph - Graph plotting utilities for ggplot2
Converting forecast prediction to tibble: sweep
Interactive (JavaScript):
networkD3 - D3 Networks in R
plotly - plotly.js (network graphs) in R Speed & Scale
Anomaly Detection
Fastest Single-Node Speed: data.table (CS)
Identify anomalies: anomalize Distributed Cluster (Spark): sparklyr (CS)
Geospatial Analysis
Geocoding (getting lat/long, bboxes, & sf's):
Interoperability
ggmap - Google API (requires key)
Python: reticulate Java: rJava
Financial Analysis osmdata - OpenStreet Overpass API
C++: Rcpp
tmaptools - OpenStreet Nominatum API
Simple Features (sf objects): sf (CS) (tidy)
Getting financial data: tidyquant & quantmod Spatial Objects (sp objects): sp (non-tidy)
Quantitative Analysis: tidyquant & xts/TTR
Portfolio Analysis: tidyquant & Miscellaneous Tools
PerformanceAnalytics
Geospatial Viz Interactive Plotting: htmlwidgets for R
Building R Packages: R packages Book
Financial & Time Viz Static: Pkg Development Tools: devtools (CS)
ggmap - Google API (requires key) R Templates: usethis
Static: osmplotr - Impressive Maps via OSM Build Web Doc's: pkgdown
tidyquant - Financial ggplot2 geoms tmap - Thematic Maps Advanced Concepts (Advanced R Book)
Interactive: cartography (CS) - Thematic Maps rlang & Tidy Evaluation (CS)
highcharter - highchart.js in R Interactive (JavaScript): Making Blogs & Books:
dygraphs - xts plotting leaflet (CS) - leaflet.js in R Make a Website/Blog: blogdown
plotly - plotly.js (financial) in R plotly - plotly.js (maps) in R Write a Web Book: bookdown
Posting Code (GitHub, Stack Overflow): reprex
Business Science University
"Data Science Education for the Enterprise" university.businessscience.io
14
Working with Different Programming Languages Widgets
Python For Data Science Cheat Sheet Kernels provide computation and communication with front-end interfaces Notebook widgets provide the ability to visualize and control changes
Jupyter Notebook like the notebooks. There are three main kernels: in your data, often as a control like a slider, textbox, etc.
Learn More Python for Data Science Interactively at www.DataCamp.com
You can use them to build interactive GUIs for your notebooks or to
IRkernel IJulia
synchronize stateful and stateless information between Python and
Installing Jupyter Notebook will automatically install the IPython kernel. JavaScript.
Saving/Loading Notebooks Restart kernel Interrupt kernel
Create new notebook Restart kernel & run Interrupt kernel & Download serialized Save notebook
all cells clear all output state of all widget with interactive
Open an existing
Connect back to a models in use widgets
Make a copy of the notebook Restart kernel & run remote notebook
current notebook all cells Embed current
Rename notebook Run other installed
widgets
kernels
Revert notebook to a
Save current notebook
previous checkpoint Command Mode:
and record checkpoint
Download notebook as
Preview of the printed - IPython notebook 15
notebook - Python
- HTML
Close notebook & stop - Markdown 13 14
- reST
running any scripts - LaTeX 1 2 3 4 5 6 7 8 9 10 11 12
- PDF
- Tags DataCamp
Learn Python for Data Science Interactively
LEARN DATA SCIENCE ONLINE
16
Start Learning For Free - www.dataquest.io
L I STS len(my_set) - Returns the number of objects in now - wks4 - Return a datetime object
l.pop(3) - Returns the fourth item from l and my_set (or, the number of unique values from l) representing the time 4 weeks prior to now
deletes it from the list a in my_set - Returns True if the value a exists in newyear_2020 = dt.datetime(year=2020,
l.remove(x) - Removes the first item in l that is my_set month=12, day=31) - Assign a datetime
equal to x object representing December 25, 2020 to
l.reverse() - Reverses the order of the items in l REGULAR EXPRESSIONS newyear_2020
l[1::2] - Returns every second item from l, import re - Import the Regular Expressions module newyear_2020.strftime("%A, %b %d, %Y")
commencing from the 1st item re.search("abc",s) - Returns a match object if - Returns "Thursday, Dec 31, 2020"
l[-5:] - Returns the last 5 items from l specific axis the regex "abc" is found in s, otherwise None dt.datetime.strptime('Dec 31, 2020',"%b
re.sub("abc","xyz",s) - Returns a string where %d, %Y") - Return a datetime object
ST R I N G S all instances matching regex "abc" are replaced representing December 31, 2020
s.lower() - Returns a lowercase version of s by "xyz"
s.title() - Returns s with the first letter of every RANDOM
word capitalized L I ST C O M P R E H E N S I O N import random - Import the random module
"23".zfill(4) - Returns "0023" by left-filling the A one-line expression of a for loop random.random() - Returns a random float
string with 0’s to make it’s length 4. [i ** 2 for i in range(10)] - Returns a list of between 0.0 and 1.0
s.splitlines() - Returns a list by splitting the the squares of values from 0 to 9 random.randint(0,10) - Returns a random
string on any newline characters. [s.lower() for s in l_strings] - Returns the integer between 0 and 10
Python strings share some common methods with lists list l_strings, with each item having had the random.choice(l) - Returns a random item from
s[:5] - Returns the first 5 characters of s .lower() method applied the list l
"fri" + "end" - Returns "friend" [i for i in l_floats if i < 0.5] - Returns
"end" in s - Returns True if the substring "end" the items from l_floats that are less than 0.5 COUNTER
is found in s from collections import Counter - Import the
F U N C T I O N S F O R LO O P I N G Counter class
RANGE for i, value in enumerate(l): c = Counter(l) - Assign a Counter (dict-like)
Range objects are useful for creating sequences of print("The value of item {} is {}". object with the counts of each unique item from
integers for looping. format(i,value)) l, to c
range(5) - Returns a sequence from 0 to 4 - Iterate over the list l, printing the index location c.most_common(3) - Return the 3 most common
range(2000,2018) - Returns a sequence from 2000 of each item and its value items from l
to 2017 for one, two in zip(l_one,l_two):
range(0,11,2) - Returns a sequence from 0 to 10, print("one: {}, two: {}".format(one,two)) T RY/ E XC E P T
with each item incrementing by 2 - Iterate over two lists, l_one and l_two and print Catch and deal with Errors
range(0,-10,-1) - Returns a sequence from 0 to -9 each value l_ints = [1, 2, 3, "", 5] - Assign a list of
list(range(5)) - Returns a list from 0 to 4 while x < 10: integers with one missing value to l_ints
x += 1 l_floats = []
DICTIONARIES - Run the code in the body of the loop until the for i in l_ints:
max(d, key=d.get) - Return the key that value of x is no longer less than 10 try:
corresponds to the largest value in d l_floats.append(float(i))
min(d, key=d.get) - Return the key that DAT E T I M E except:
corresponds to the smallest value in d import datetime as dt - Import the datetime l_floats.append(i)
module - Convert each value of l_ints to a float, catching
S E TS now = dt.datetime.now() - Assign datetime and handling ValueError: could not convert
my_set = set(l) - Return a set object containing object representing the current time to now string to float: where values are missing.
the unique values from l wks4 = dt.datetime.timedelta(weeks=4)
- Assign a timedelta object representing a
timespan of 4 weeks to wks4
KEY IMPORTS
We’ll use shorthand in this cheat sheet Import these to start
arr - A numpy Array object import numpy as np
KEY IMPORTS
We’ll use shorthand in this cheat sheet Import these to start
df - A pandas DataFrame object import pandas as pd
s - A pandas Series object import numpy as np
S P E C I A L C H A R AC T E R S \A | Matches the expression to its right at the (?:A) | Matches the expression as represented
^ | Matches the expression to its right at the absolute start of a string whether in single by A, but unlike (?PAB), it cannot be
start of a string. It matches every such or multi-line mode. retrieved afterwards.
instance before each \n in the string. \Z | Matches the expression to its left at the (?#...) | A comment. Contents are for us to
$ | Matches the expression to its left at the absolute end of a string whether in single read, not for matching.
end of a string. It matches every such or multi-line mode. A(?=B) | Lookahead assertion. This matches
instance before each \n in the string. the expression A only if it is followed by B.
. | Matches any character except line A(?!B) | Negative lookahead assertion. This
terminators like \n. S E TS matches the expression A only if it is not
\ | Escapes special characters or denotes [ ] | Contains a set of characters to match. followed by B.
character classes. [amk] | Matches either a, m, or k. It does not (?<=B)A | Positive lookbehind assertion.
A|B | Matches expression A or B. If A is match amk. This matches the expression A only if B
matched first, B is left untried. [a-z] | Matches any alphabet from a to z. is immediately to its left. This can only
+ | Greedily matches the expression to its left 1 [a\-z] | Matches a, -, or z. It matches - matched fixed length expressions.
or more times. because \ escapes it. (?<!B)A | Negative lookbehind assertion.
* | Greedily matches the expression to its left [a-] | Matches a or -, because - is not being This matches the expression A only if B is
0 or more times. used to indicate a series of characters. not immediately to its left. This can only
? | Greedily matches the expression to its left [-a] | As above, matches a or -. matched fixed length expressions.
0 or 1 times. But if ? is added to qualifiers [a-z0-9] | Matches characters from a to z (?P=name) | Matches the expression matched
(+, *, and ? itself) it will perform matches in and also from 0 to 9. by an earlier group named “name”.
a non-greedy manner. [(+*)] | Special characters become literal (...)\1 | The number 1 corresponds to
{m} | Matches the expression to its left m inside a set, so this matches (, +, *, and ). the first group to be matched. If we want
times, and not less. [^ab5] | Adding ^ excludes any character in to match more instances of the same
{m,n} | Matches the expression to its left m to the set. Here, it matches characters that are expression, simply use its number instead of
n times, and not less. not a, b, or 5. writing out the whole expression again. We
{m,n}? | Matches the expression to its left m can use from 1 up to 99 such groups and
times, and ignores n. See ? above. their corresponding numbers.
GROUPS
( ) | Matches the expression inside the
C H A R AC T E R C L AS S E S parentheses and groups it. POPULAR PYTHON RE MODULE
( A. K.A. S P E C I A L S E Q U E N C E S) (?) | Inside parentheses like this, ? acts as an FUNCTIONS
\w | Matches alphanumeric characters, which extension notation. Its meaning depends on re.findall(A, B) | Matches all instances
means a-z, A-Z, and 0-9. It also matches the character immediately to its right. of an expression A in a string B and returns
the underscore, _. (?PAB) | Matches the expression AB, and it them in a list.
\d | Matches digits, which means 0-9. can be accessed with the group name. re.search(A, B) | Matches the first instance
\D | Matches any non-digits. (?aiLmsux) | Here, a, i, L, m, s, u, and x are of an expression A in a string B, and returns
\s | Matches whitespace characters, which flags: it as a re match object.
include the \t, \n, \r, and space characters. a — Matches ASCII only re.split(A, B) | Split a string B into a list
\S | Matches non-whitespace characters. i — Ignore case using the delimiter A.
\b | Matches the boundary (or empty string) L — Locale dependent re.sub(A, B, C) | Replace A with B in the
at the start and end of a word, that is, m — Multi-line string C.
between \w and \W. s — Matches all
\B | Matches where \b does not, that is, the u — Matches unicode
boundary of \w characters. x — Verbose
na_values=[""]) String to recognize as NA/NaN >>> data_array = data.values Convert a DataFrame to an a NumPy array DataCamp
Learn R for Data Science Interactively
Lists Also see NumPy Arrays Libraries
Python For Data Science Cheat Sheet >>> a = 'is' Import libraries
Python Basics >>> b = 'nice' >>> import numpy Data analysis Machine learning
Learn More Python for Data Science Interactively at www.datacamp.com >>> my_list = ['my', 'list', a, b] >>> import numpy as np
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] Selective import
>>> from math import pi Scientific computing 2D plo"ing
Variables and Data Types Selecting List Elements Index starts at 0
Subset Install Python
Variable Assignment
>>> my_list[1] Select item at index 1
>>> x=5 Select 3rd last item
>>> my_list[-3]
>>> x
Slice
5 >>> my_list[1:3] Select items at index 1 and 2
Calculations With Variables >>> my_list[1:] Select items a!er index 0
>>> my_list[:3] Select items before index 3 Leading open data science platform Free IDE that is included Create and share
>>> x+2 Sum of two variables powered by Python with Anaconda documents with live code,
>>> my_list[:] Copy my_list
7 visualizations, text, ...
>>> x-2 Subtraction of two variables
Subset Lists of Lists
>>> my_list2[1][0] my_list[list][itemOfList]
3 Numpy Arrays Also see Lists
>>> my_list2[1][:2]
>>> x*2 Multiplication of two variables
>>> my_list = [1, 2, 3, 4]
10 List Operations
>>> x**2 Exponentiation of a variable >>> my_array = np.array(my_list)
25 >>> my_list + my_list >>> my_2darray = np.array([[1,2,3],[4,5,6]])
>>> x%2 Remainder of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']
Selecting Numpy Array Elements Index starts at 0
1 >>> my_list * 2
>>> x/float(2) Division of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] Subset
2.5 >>> my_list2 > 4 >>> my_array[1] Select item at index 1
True 2
Types and Type Conversion Slice
List Methods >>> my_array[0:2] Select items at index 0 and 1
str() '5', '3.45', 'True' Variables to strings
my_list.index(a) Get the index of an item array([1, 2])
>>>
int() 5, 3, 1 Variables to integers >>> my_list.count(a) Count an item Subset 2D Numpy arrays
>>> my_list.append('!') Append an item at a time >>> my_2darray[:,0] my_2darray[rows, columns]
my_list.remove('!') Remove an item array([1, 4])
float() 5.0, 1.0 Variables to floats >>>
>>> del(my_list[0:1]) Remove an item Numpy Array Operations
bool() True, True, True Variables to booleans >>> my_list.reverse() Reverse the list
>>> my_array > 3
>>> my_list.extend('!') Append an item array([False, False, False, True], dtype=bool)
>>> my_list.pop(-1) Remove an item >>> my_array * 2
Asking For Help >>> my_list.insert(0,'!') Insert an item array([2, 4, 6, 8])
>>> help(str) >>> my_list.sort() Sort the list >>> my_array + np.array([5, 6, 7, 8])
array([6, 8, 10, 12])
Strings
>>> my_string = 'thisStringIsAwesome' Numpy Array Functions
String Operations Index starts at 0
>>> my_string >>> my_array.shape Get the dimensions of the array
'thisStringIsAwesome' >>> my_string[3] >>> np.append(other_array) Append items to an array
>>> my_string[4:9] >>> np.insert(my_array, 1, 5) Insert items in an array
String Operations >>> np.delete(my_array,[1]) Delete items in an array
String Methods >>> np.mean(my_array) Mean of the array
>>> my_string * 2
'thisStringIsAwesomethisStringIsAwesome' >>> my_string.upper() String to uppercase >>> np.median(my_array) Median of the array
>>> my_string + 'Innit' >>> my_string.lower() String to lowercase >>> my_array.corrcoef() Correlation coefficient
'thisStringIsAwesomeInnit' >>> my_string.count('w') Count String elements >>> np.std(my_array) Standard deviation
>>> 'm' in my_string >>> my_string.replace('e', 'i') Replace String elements
22
>>> np.unicode_ Fixed-length unicode type >>> c.sort(axis=0) Sort the elements of an array's axis DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Asking For Help Dropping
>>> help(pd.Series.loc)
>>> s.drop(['a', 'c']) Drop values from rows (axis=0)
Pandas Basics Selection Also see NumPy Arrays >>> df.drop('Country', axis=1) Drop values from columns(axis=1)
Learn Python for Data Science Interactively at www.DataCamp.com
Ge!ing
>>> s['b'] Get one element Sort & Rank
-5
Pandas >>> df.sort_index() Sort by labels along an axis
>>> df[1:] Get subset of a DataFrame >>> df.sort_values(by='Country') Sort by the values along an axis
The Pandas library is built on NumPy and provides easy-to-use Country Capital Population >>> df.rank() Assign ranks to entries
data structures and data analysis tools for the Python 1 India New Delhi 1303171035
2 Brazil Brasília 207847528
programming language. Retrieving Series/DataFrame Information
Selecting, Boolean Indexing & Se!ing Basic Information
Use the following import convention: By Position >>> df.shape (rows,columns)
>>> import pandas as pd >>> df.iloc[[0],[0]] Select single value by row & >>> df.index Describe index
'Belgium' column >>> df.columns Describe DataFrame columns
Pandas Data Structures >>> df.info() Info on DataFrame
>>> df.iat([0],[0]) >>> df.count() Number of non-NA values
Series 'Belgium'
Summary
A one-dimensional labeled array a 3 By Label
>>> df.loc[[0], ['Country']] Select single value by row & >>> df.sum() Sum of values
capable of holding any data type b -5 >>> df.cumsum() Cummulative sum of values
'Belgium' column labels >>> df.min()/df.max() Minimum/maximum values
c 7 >>> df.at([0], ['Country']) >>> df.idxmin()/df.idxmax() Minimum/Maximum index value
Index
d 4 'Belgium' >>> df.describe() Summary statistics
>>> df.mean() Mean of values
By Label/Position >>> df.median() Median of values
>>> s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
>>> df.ix[2] Select single row of
DataFrame Country Brazil subset of rows
Capital Brasília
Applying Functions
Population 207847528 >>> f = lambda x: x*2
Columns
Country Capital Population A two-dimensional labeled >>> df.ix[:,'Capital'] Select a single column of >>> df.apply(f) Apply function
>>> df.applymap(f) Apply function element-wise
data structure with columns 0 Brussels subset of columns
0 Belgium Brussels 11190846 1 New Delhi
of potentially different types 2 Brasília Data Alignment
1 India New Delhi 1303171035
Index >>> df.ix[1,'Capital'] Select rows and columns
2 Brazil Brasília 207847528 Internal Data Alignment
'New Delhi'
NA values are introduced in the indices that don’t overlap:
Boolean Indexing
>>> data = {'Country': ['Belgium', 'India', 'Brazil'], >>> s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
>>> s[~(s > 1)] Series s where value is not >1
'Capital': ['Brussels', 'New Delhi', 'Brasília'], >>> s[(s < -1) | (s > 2)] s where value is <-1 or >2 >>> s + s3
'Population': [11190846, 1303171035, 207847528]} >>> df[df['Population']>1200000000] Use filter to adjust DataFrame a 10.0
b NaN
>>> df = pd.DataFrame(data, Se!ing
c 5.0
columns=['Country', 'Capital', 'Population']) >>> s['a'] = 6 Set index a of Series s to 6
d 7.0
>>> df.iterrows() (Row-index, Series) pairs >>> df2.replace("a", "f") Replace values with others
DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Plot Anatomy & Workflow
Plot Anatomy Workflow
Matplotlib Axes/Subplot The basic steps to creating plots with matplotlib are:
Learn Python Interactively at www.DataCamp.com 1 Prepare data 2 Create plot 3 Plot 4 Customize plot 5 Save plot 6 Show plot
>>> import matplotlib.pyplot as plt
>>> x = [1,2,3,4] Step 1
>>> y = [10,20,25,30]
>>> fig = plt.figure() Step 2
Matplotlib Y-axis Figure >>> ax = fig.add_subplot(111) Step 3
>>> ax.plot(x, y, color='lightblue', linewidth=3) Step 3, 4
Matplotlib is a Python 2D plo!ing library which produces >>> ax.scatter([2,4,6],
publication-quality figures in a variety of hardcopy formats [5,15,25],
color='darkgreen',
and interactive environments across marker='^')
platforms. >>> ax.set_xlim(1, 6.5)
X-axis
>>> plt.savefig('foo.png')
>>> plt.show() Step 6
1 Prepare The Data Also see Lists & NumPy
1D Data 4 Customize Plot
>>> import numpy as np Colors, Color Bars & Color Maps Mathtext
>>> x = np.linspace(0, 10, 100)
>>> y = np.cos(x) >>> plt.plot(x, x, x, x**2, x, x**3) >>> plt.title(r'$sigma_i=15$', fontsize=20)
>>> z = np.sin(x) >>> ax.plot(x, y, alpha = 0.4)
>>> ax.plot(x, y, c='k') Limits, Legends & Layouts
2D Data or Images >>> fig.colorbar(im, orientation='horizontal')
>>> im = ax.imshow(img, Limits & Autoscaling
>>> data = 2 * np.random.random((10, 10)) cmap='seismic')
>>> data2 = 3 * np.random.random((10, 10)) >>> ax.margins(x=0.0,y=0.1) Add padding to a plot
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j] >>> ax.axis('equal') Set the aspect ratio of the plot to 1
Markers >>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) Set limits for x-and y-axis
>>> U = -1 - X**2 + Y
>>> V = 1 + X - Y**2 >>> fig, ax = plt.subplots() >>> ax.set_xlim(0,10.5) Set limits for x-axis
>>> from matplotlib.cbook import get_sample_data >>> ax.scatter(x,y,marker=".") Legends
>>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy')) >>> ax.plot(x,y,marker="o") >>> ax.set(title='An Example Axes', Set a title and x-and y-axis labels
ylabel='Y-Axis',
Linestyles xlabel='X-Axis')
Create Plot >>> ax.legend(loc='best') No overlapping plot elements
2 >>> plt.plot(x,y,linewidth=4.0) Ticks
>>> import matplotlib.pyplot as plt >>> plt.plot(x,y,ls='solid') >>> ax.xaxis.set(ticks=range(1,5), Manually set x-ticks
>>> plt.plot(x,y,ls='--') ticklabels=[3,100,-12,"foo"])
Figure >>> plt.plot(x,y,'--',x**2,y**2,'-.') >>> ax.tick_params(axis='y', Make y-ticks longer and go in and out
>>> plt.setp(lines,color='r',linewidth=4.0) direction='inout',
>>> fig = plt.figure() length=10)
>>> fig2 = plt.figure(figsize=plt.figaspect(2.0)) Text & Annotations
Subplot Spacing
Axes >>> ax.text(1, >>> fig3.subplots_adjust(wspace=0.5, Adjust the spacing between subplots
-2.1, hspace=0.3,
All plo!ing is done with respect to an Axes. In most cases, a 'Example Graph', left=0.125,
style='italic') right=0.9,
subplot will fit your needs. A subplot is an axes on a grid system. >>> ax.annotate("Sine", top=0.9,
>>> fig.add_axes() xy=(8, 0), bottom=0.1)
>>> ax1 = fig.add_subplot(221) # row-col-num xycoords='data', >>> fig.tight_layout() Fit subplot(s) in to the figure area
xytext=(10.5, 0),
>>> ax3 = fig.add_subplot(212) textcoords='data', Axis Spines
>>> fig3, axes = plt.subplots(nrows=2,ncols=2) arrowprops=dict(arrowstyle="->", >>> ax1.spines['top'].set_visible(False) Make the top axis line for a plot invisible
>>> fig4, axes2 = plt.subplots(ncols=3) connectionstyle="arc3"),) >>> ax1.spines['bottom'].set_position(('outward',10)) Move the bo!om axis line outward
vmin=-2, DataCamp
vmax=2) >>> axes2[2]= ax.clabel(CS) Label a contour plot
Learn Python for Data Science Interactively
Matplotlib 2.0.0 - Updated on: 02/2017
Python For Data Science Cheat Sheet 3 Plo!ing With Seaborn
Seaborn Axis Grids
Learn Data Science Interactively at www.DataCamp.com >>> g = sns.FacetGrid(titanic, Subplot grid for plo!ing conditional >>> h = sns.PairGrid(iris) Subplot grid for plo!ing pairwise
col="survived", relationships >>> h = h.map(plt.scatter) relationships
row="sex") >>> sns.pairplot(iris) Plot pairwise bivariate distributions
>>> g = g.map(plt.hist,"age") >>> i = sns.JointGrid(x="x", Grid for bivariate plot with marginal
>>> sns.factorplot(x="pclass", Draw a categorical plot onto a y="y", univariate plots
y="survived", Facetgrid data=data)
hue="sex", >>> i = i.plot(sns.regplot,
Statistical Data Visualization With Seaborn data=titanic) sns.distplot)
The Python visualization library Seaborn is based on >>> sns.lmplot(x="sepal_width", Plot data and regression model fits >>> sns.jointplot("sepal_length", Plot bivariate distribution
y="sepal_length", across a FacetGrid "sepal_width",
matplotlib and provides a high-level interface for drawing hue="species", data=iris,
a!ractive statistical graphics. data=iris) kind='kde')
with to temporarily set the style >>> sns.set_palette(flatui) Set your own color pale!e DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet 3 Renderers & Visual Customizations
Bokeh Glyphs Grid Layout
Learn Bokeh Interactively at www.DataCamp.com, Sca!er Markers >>> from bokeh.layouts import gridplot
taught by Bryan Van de Ven, core contributor >>> p1.circle(np.array([1,2,3]), np.array([3,2,1]), >>> row1 = [p1,p2]
fill_color='white') >>> row2 = [p3]
>>> p2.square(np.array([1.5,3.5,5.5]), [1,4,3], >>> layout = gridplot([[p1,p2],[p3]])
color='blue', size=1)
Line Glyphs Tabbed Layout
Plo!ing With Bokeh >>> p1.line([1,2,3,4], [3,4,5,6], line_width=2)
>>> p2.multi_line(pd.DataFrame([[1,2,3],[5,6,7]]), >>> from bokeh.models.widgets import Panel, Tabs
The Python interactive visualization library Bokeh >>> tab1 = Panel(child=p1, title="tab1")
pd.DataFrame([[3,4,5],[3,2,1]]),
enables high-performance visual presentation of color="blue") >>> tab2 = Panel(child=p2, title="tab2")
>>> layout = Tabs(tabs=[tab1, tab2])
large datasets in modern web browsers.
Customized Glyphs Also see Data
Linked Plots
Bokeh’s mid-level general purpose bokeh.plotting Selection and Non-Selection Glyphs
>>> p = figure(tools='box_select') Linked Axes
interface is centered around two main components: data >>> p.circle('mpg', 'cyl', source=cds_df, >>> p2.x_range = p1.x_range
and glyphs. selection_color='red', >>> p2.y_range = p1.y_range
nonselection_alpha=0.1) Linked Brushing
>>> p4 = figure(plot_width = 100,
+ = Hover Glyphs tools='box_select,lasso_select')
>>> from bokeh.models import HoverTool
>>> p4.circle('mpg', 'cyl', source=cds_df)
data glyphs plot >>> hover = HoverTool(tooltips=None, mode='vline')
>>> p5 = figure(plot_width = 200,
>>> p3.add_tools(hover)
tools='box_select,lasso_select')
The basic steps to creating plots with the bokeh.plotting >>> p5.circle('mpg', 'hp', source=cds_df)
interface are: >>> layout = row(p4,p5)
US
Asia
Europe
Colormapping
1. Prepare some data: >>> from bokeh.models import CategoricalColorMapper
Python lists, NumPy arrays, Pandas DataFrames and other sequences of values >>> color_mapper = CategoricalColorMapper(
factors=['US', 'Asia', 'Europe'], 4 Output & Export
2. Create a new plot palette=['blue', 'red', 'green'])
3. Add renderers for your data, with visual customizations >>> p3.circle('mpg', 'cyl', source=cds_df, Notebook
color=dict(field='origin',
4. Specify where to generate the output transform=color_mapper), >>> from bokeh.io import output_notebook, show
5. Show or save the results legend='Origin') >>> output_notebook()
>>> from bokeh.plotting import figure
>>> from bokeh.io import output_file, show Legend Location HTML
>>> x = [1, 2, 3, 4, 5] Step 1
>>> y = [6, 7, 2, 4, 5] Inside Plot Area Standalone HTML
Step 2 >>> p.legend.location = 'bottom_left' >>> from bokeh.embed import file_html
>>> p = figure(title="simple line example",
>>> from bokeh.resources import CDN
x_axis_label='x',
>>> html = file_html(p, CDN, "my_plot")
y_axis_label='y') Outside Plot Area
>>> p.line(x, y, legend="Temp.", line_width=2) Step 3 >>> from bokeh.models import Legend
>>> r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1]) >>> from bokeh.io import output_file, show
>>> output_file("lines.html") Step 4 >>> r2 = p2.line([1,2,3,4], [3,4,5,6]) >>> output_file('my_bar_chart.html', mode='cdn')
>>> show(p) Step 5 >>> legend = Legend(items=[("One" ,[p1, r1]),("Two",[r2])],
location=(0, -30)) Components
>>> p.add_layout(legend, 'right')
>>> from bokeh.embed import components
1 Data Also see Lists, NumPy & Pandas >>> script, div = components(p)
Under the hood, your data is converted to Column Data Legend Orientation
Sources. You can also do this manually: >>> p.legend.orientation = "horizontal" PNG
>>> import numpy as np >>> p.legend.orientation = "vertical"
>>> from bokeh.io import export_png
>>> import pandas as pd >>> export_png(p, filename="plot.png")
>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'], Legend Background & Border
[32.4,4,66, 'Asia'],
[21.4,4,109, 'Europe']]), >>> p.legend.border_line_color = "navy" SVG
columns=['mpg','cyl', 'hp', 'origin'], >>> p.legend.background_fill_color = "white"
index=['Toyota', 'Fiat', 'Volvo']) >>> from bokeh.io import export_svgs
>>> from bokeh.models import ColumnDataSource Rows & Columns Layout >>> p.output_backend = "svg"
>>> export_svgs(p, filename="plot.svg")
>>> cds_df = ColumnDataSource(df) Rows
>>> from bokeh.layouts import row
>>> layout = row(p1,p2,p3)
2 Plo!ing 5 Show or Save Your Plots
Columns
>>> from bokeh.plotting import figure >>> from bokeh.layouts import columns >>> show(p1) >>> show(layout)
>>> p1 = figure(plot_width=300, tools='pan,box_zoom') >>> layout = column(p1,p2,p3) >>> save(p1) >>> save(layout)
>>> p2 = figure(plot_width=300, plot_height=300, Nesting Rows & Columns
28
>>> misc.derivative(myfunc,1.0) Find the n-th derivative of a function at a point >>> help(scipy.linalg.diagsvd)
>>> np.info(np.matrix) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Duplicate Values GroupBy
>>> df = df.dropDuplicates() >>> df.groupBy("age")\ Group by age, count the members
PySpark - SQL Basics .count() \ in the groups
.show()
Learn Python for data science Interactively at www.DataCamp.com Queries
>>> from pyspark.sql import functions as F Filter
Select
>>> df.select("firstName").show() Show all entries in firstName column >>> df.filter(df["age"]>24).show() Filter entries of age, only keep those
>>> df.select("firstName","lastName") \ records of which the values are >24
PySpark & Spark SQL .show()
>>> df.select("firstName", Show all entries in firstName, age
Spark SQL is Apache Spark's module for "age", and type
explode("phoneNumber") \ Sort
working with structured data. .alias("contactInfo")) \
.select("contactInfo.type", >>> peopledf.sort(peopledf.age.desc()).collect()
"firstName", >>> df.sort("age", ascending=False).collect()
Initializing SparkSession "age") \ >>> df.orderBy(["age","city"],ascending=[0,1])\
A SparkSession can be used create DataFrame, register DataFrame as tables, .show() .collect()
execute SQL over tables, cache tables, and read parquet files. >>> df.select(df["firstName"],df["age"]+ 1) Show all entries in firstName and age,
.show() add 1 to the entries of age
>>> from pyspark.sql import SparkSession >>> df.select(df['age'] > 24).show() Show all entries where age >24
>>> spark = SparkSession \ When
Missing & Replacing Values
.builder \ >>> df.select("firstName", Show firstName and 0 or 1 depending
.appName("Python Spark SQL basic example") \ >>> df.na.fill(50).show() Replace null values
F.when(df.age > 30, 1) \ on age >30 >>> df.na.drop().show() Return new df omi!ing rows with null values
.config("spark.some.config.option", "some-value") \ .otherwise(0)) \
.getOrCreate() >>> df.na \ Return new df replacing one value with
.show() .replace(10, 20) \ another
>>> df[df.firstName.isin("Jane","Boris")] Show firstName if in the given options .show()
.collect()
Creating DataFrames Like
>>> df.select("firstName", Show firstName, and lastName is Repartitioning
From RDDs df.lastName.like("Smith")) \ TRUE if lastName is like Smith
.show()
>>> from pyspark.sql.types import * Startswith - Endswith >>> df.repartition(10)\ df with 10 partitions
>>> df.select("firstName", Show firstName, and TRUE if .rdd \
Infer Schema .getNumPartitions()
>>> sc = spark.sparkContext df.lastName \ lastName starts with Sm
.startswith("Sm")) \ >>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition
>>> lines = sc.textFile("people.txt")
.show()
>>> parts = lines.map(lambda l: l.split(",")) >>> df.select(df.lastName.endswith("th")) \ Show last names ending in th
>>> people = parts.map(lambda p: Row(name=p[0],age=int(p[1]))) .show() Running SQL Queries Programmatically
>>> peopledf = spark.createDataFrame(people) Substring
Specify Schema >>> df.select(df.firstName.substr(1, 3) \ Return substrings of firstName Registering DataFrames as Views
>>> people = parts.map(lambda p: Row(name=p[0], .alias("name")) \
age=int(p[1].strip()))) .collect() >>> peopledf.createGlobalTempView("people")
>>> schemaString = "name age" Between >>> df.createTempView("customer")
>>> fields = [StructField(field_name, StringType(), True) for >>> df.select(df.age.between(22, 24)) \ Show age: values are TRUE if between >>> df.createOrReplaceTempView("customer")
field_name in schemaString.split()] .show() 22 and 24
>>> schema = StructType(fields) Query Views
>>> spark.createDataFrame(people, schema).show()
+--------+---+ Add, Update & Remove Columns
| name|age| >>> df5 = spark.sql("SELECT * FROM customer").show()
+--------+---+ >>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
| Mine| 28| Adding Columns
| Filip| 29| .show()
|Jonathan| 30|
+--------+---+ >>> df = df.withColumn('city',df.address.city) \
.withColumn('postalCode',df.address.postalCode) \
.withColumn('state',df.address.state) \ Output
From Spark Data Sources .withColumn('streetAddress',df.address.streetAddress) \
.withColumn('telePhoneNumber', Data Structures
JSON explode(df.phoneNumber.number)) \
>>> df = spark.read.json("customer.json") .withColumn('telePhoneType',
>>> df.show() >>> rdd1 = df.rdd Convert df into an RDD
+--------------------+---+---------+--------+--------------------+ explode(df.phoneNumber.type)) >>> df.toJSON().first() Convert df into a RDD of string
| address|age|firstName |lastName| phoneNumber|
+--------------------+---+---------+--------+--------------------+ >>> df.toPandas() Return the contents of df as Pandas
|[New York,10021,N...| 25| John| Smith|[[212 555-1234,ho...| Updating Columns DataFrame
|[New York,10021,N...| 21| Jane| Doe|[[322 888-1234,ho...|
+--------------------+---+---------+--------+--------------------+
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') Write & Save to Files
>>> df2 = spark.read.load("people.json", format="json")
Parquet files Removing Columns >>> df.select("firstName", "city")\
>>> df3 = spark.read.load("users.parquet") .write \
TXT files >>> df = df.drop("address", "phoneNumber") .save("nameAndCity.parquet")
>>> df4 = spark.read.text("people.txt") >>> df = df.drop(df.address).drop(df.phoneNumber) >>> df.select("firstName", "age") \
.write \
.save("namesAndAges.json",format="json")
Inspect Data
>>> df.dtypes Return df column names and data types >>> df.describe().show() Compute summary statistics Stopping SparkSession
>>> df.show() Display the content of df >>> df.columns Return the columns of df
>>> df.count() Count the number of rows in df >>> spark.stop()
>>> df.head() Return first n rows
>>> df.first() Return first row >>> df.distinct().count() Count the number of distinct rows in df
>>> df.printSchema() Print the schema of df
30
('b', 2) DataCamp
>>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Create Your Model Evaluate Your Model’s Performance
Supervised Learning Estimators Classification Metrics
Scikit-Learn
Learn Python for data science Interactively at www.DataCamp.com Linear Regression Accuracy Score
>>> from sklearn.linear_model import LinearRegression >>> knn.score(X_test, y_test) Estimator score method
>>> lr = LinearRegression(normalize=True) >>> from sklearn.metrics import accuracy_score Metric scoring functions
>>> accuracy_score(y_test, y_pred)
Support Vector Machines (SVM)
Scikit-learn >>> from sklearn.svm import SVC Classification Report
>>> svc = SVC(kernel='linear') >>> from sklearn.metrics import classification_report Precision, recall, f1-score
Scikit-learn is an open source Python library that >>> print(classification_report(y_test, y_pred)) and support
Naive Bayes
implements a range of machine learning, >>> from sklearn.naive_bayes import GaussianNB Confusion Matrix
>>> gnb = GaussianNB() >>> from sklearn.metrics import confusion_matrix
preprocessing, cross-validation and visualization >>> print(confusion_matrix(y_test, y_pred))
algorithms using a unified interface. KNN
>>> from sklearn import neighbors Regression Metrics
A Basic Example >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> from sklearn import neighbors, datasets, preprocessing
Mean Absolute Error
>>> from sklearn.model_selection import train_test_split Unsupervised Learning Estimators >>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.metrics import accuracy_score >>> y_true = [3, -0.5, 2]
>>> iris = datasets.load_iris() Principal Component Analysis (PCA) >>> mean_absolute_error(y_true, y_pred)
>>> X, y = iris.data[:, :2], iris.target >>> from sklearn.decomposition import PCA Mean Squared Error
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) >>> pca = PCA(n_components=0.95) >>> from sklearn.metrics import mean_squared_error
>>> scaler = preprocessing.StandardScaler().fit(X_train) >>> mean_squared_error(y_test, y_pred)
>>> X_train = scaler.transform(X_train)
K Means
>>> X_test = scaler.transform(X_test) >>> from sklearn.cluster import KMeans R² Score
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> k_means = KMeans(n_clusters=3, random_state=0) >>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred) Model Fi!ing Clustering Metrics
Adjusted Rand Index
Supervised learning >>> from sklearn.metrics import adjusted_rand_score
Also see NumPy & Pandas >>> lr.fit(X, y) Fit the model to the data
Loading The Data >>> adjusted_rand_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse >>> svc.fit(X_train, y_train) Homogeneity
>>> from sklearn.metrics import homogeneity_score
matrices. Other types that are convertible to numeric arrays, such as Pandas Unsupervised Learning >>> homogeneity_score(y_true, y_pred)
DataFrame, are also acceptable. >>> k_means.fit(X_train) Fit the model to the data
>>> pca_model = pca.fit_transform(X_train) Fit to data, then transform it V-measure
>>> import numpy as np >>> from sklearn.metrics import v_measure_score
>>> X = np.random.random((10,5)) >>> metrics.v_measure_score(y_true, y_pred)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0 Prediction Cross-Validation
>>> from sklearn.cross_validation import cross_val_score
Supervised Estimators >>> print(cross_val_score(knn, X_train, y_train, cv=4))
Training And Test Data >>> y_pred = svc.predict(np.random.random((2,5))) Predict labels >>> print(cross_val_score(lr, X, y, cv=2))
>>> y_pred = lr.predict(X_test) Predict labels
>>> from sklearn.model_selection import train_test_split >>> y_pred = knn.predict_proba(X_test) Estimate probability of a label
>>> X_train, X_test, y_train, y_test = train_test_split(X, Tune Your Model
y, Unsupervised Estimators
random_state=0) >>> y_pred = k_means.predict(X_test) Predict labels in clustering algos Grid Search
>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3),
Preprocessing The Data "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,
Standardization Encoding Categorical Features param_grid=params)
>>> grid.fit(X_train, y_train)
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import LabelEncoder >>> print(grid.best_score_)
>>> scaler = StandardScaler().fit(X_train) >>> print(grid.best_estimator_.n_neighbors)
>>> enc = LabelEncoder()
>>> standardized_X = scaler.transform(X_train) >>> y = enc.fit_transform(y)
>>> standardized_X_test = scaler.transform(X_test) Randomized Parameter Optimization
Normalization Imputing Missing Values >>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5),
>>> from sklearn.preprocessing import Normalizer "weights": ["uniform", "distance"]}
>>> from sklearn.preprocessing import Imputer >>> rsearch = RandomizedSearchCV(estimator=knn,
>>> scaler = Normalizer().fit(X_train) >>> imp = Imputer(missing_values=0, strategy='mean', axis=0) param_distributions=params,
>>> normalized_X = scaler.transform(X_train) >>> imp.fit_transform(X_train) cv=4,
>>> normalized_X_test = scaler.transform(X_test) n_iter=8,
random_state=5)
Binarization Generating Polynomial Features >>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
>>> from sklearn.preprocessing import Binarizer >>> from sklearn.preprocessing import PolynomialFeatures
>>> binarizer = Binarizer(threshold=0.0).fit(X) >>> poly = PolynomialFeatures(5)
32
Span indices are exclusive. So doc[2:4] is a span starting at doc = nlp("This a sentence. This is another one.")
token 2, up to – but not including! – token 4. # doc.sents is a generator that yields sentence spans
About spaCy
[sent.text for sent in doc.sents]
spaCy is a free, open-source library for advanced Natural doc = nlp("This is a text") # ['This is a sentence.', 'This is another one.']
Language Processing (NLP) in Python. It's designed span = doc[2:4]
specifically for production use and helps you build span.text
applications that process and "understand" large volumes Base noun phrases NEEDS THE TAGGER AND PARSER
# 'a text'
of text. Documentation: spacy.io
doc = nlp("I have a red car")
Creating a span manually
# doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]
$ pip install spacy # Import the Span object
# ['I', 'a red car']
from spacy.tokens import Span
# Create a Doc object
import spacy
doc = nlp("I live in New York")
# Span for "New York" with label GPE (geopolitical)
span = Span(doc, 3, 5, label="GPE") Label explanations
Statistical models span.text
# 'New York' spacy.explain("RB")
# 'adverb'
Download statistical models spacy.explain("GPE")
Predict part-of-speech tags, dependency labels, named # 'Countries, cities, states'
entities and more. See here for available models: Linguistic features
spacy.io/models
Attributes return label IDs. For string labels, use the
$ python -m spacy download en_core_web_sm attributes with an underscore. For example, token.pos_ . Visualizing
Check that your installed models are up to date If you're in a Jupyter notebook, use displacy.render .
Part-of-speech tags PREDICTED BY STATISTICAL MODEL Otherwise, use displacy.serve to start a web server and
$ python -m spacy validate show the visualization in your browser.
doc = nlp("This is a text.")
Loading statistical models # Coarse-grained part-of-speech tags from spacy import displacy
[token.pos_ for token in doc]
import spacy # ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']
# Load the installed model "en_core_web_sm" # Fine-grained part-of-speech tags Visualize dependencies
nlp = spacy.load("en_core_web_sm") [token.tag_ for token in doc]
# ['DT', 'VBZ', 'DT', 'NN', '.'] doc = nlp("This is a sentence")
displacy.render(doc, style="dep")
Text tokenizer tagger parser ner ... Doc doc[3:5].has_label("GPE") Part-of-speech (POS) Assigning word types to tokens like verb or noun.
# True Tagging
f
orvari
abl
es
,fun
cti
ons
, Identiers type(e xp
res si
on ) Conversions
modu
les
,cl
ass
es…names
int("15") → 15
int("3f",16) → 63 nd
can specify integer number base in 2 parameter
a…zA…Z_followed by a…zA…Z_0…9
◽ diacritics allowed but should be avoided int(15.56) → 15 truncate decimal part
◽ language keywords forbidden float("-11.24e8") → -1124000000.0
◽ lower/UPPER case discrimination round(15.56,1)→ 15.6 rounding to 1 decimal (0 decimal → integer number)
a toto x7 y_max BigOne bool(x) Falsefor null x, empty container x, Noneor Falsex ; Truefor other x
⇥ 8y and for str(x)→ "…" representation string of xfor display ( cf.f orma tt
in gont
heb a
ck)
chr(64)→'@' ord('@')→64 code ↔ char
= Variables assignment
repr(x)→ "…" l it
er
a l
representation string of x
☝ assignment ⇔ binding of a namewith a v
alu
e
1)e va l
uati
onofr i
ghtsideex pre
ssio
nva l
u e bytes([72,9,64]) → b'H\t@'
2)a ssignmenti
no rderwithl eft
sidena
me s list("abc") → ['a','b','c']
x=1.2+8+sin(y) dict([(3,"three"),(1,"one")]) → {1:'one',3:'three'}
a=b=c=0 as s
ignme ntt
osamev a
lue set(["one","two"]) → {'one','two'}
y,z,r=9.2,-7.6,0 mul t
ipl
eass
ign
ments separat
orstrand s equenceo fstr→ a ssemb ledstr
a,b=b,a valu esswap ':'.join(['toto','12','pswd']) → 'toto:12:pswd'
a,*b=seq unpacki ngofseq
uencei
n strs plit
tedo nwh it
espaces→ listof str
*a,b=seq i tema n
dl i
st "words with spaces".split() → ['words','with','spaces']
and strs plit
tedo ns eparatorstr→ listo fstr
x+=3 i ncre
me nt% x=x+3 *=
x-=2 decr eme nt% x=x-2 /= "1,4,8,2".split(",") → ['1','4','8','2']
x=None «undefin
e d»const
antval
ue %= sequence of one type → listof another type ( vi
al istcomp rehe nsion )
del x r
emo
ven
amex … [int(x) for x in ('1','29','-3')] → [1,29,-3]
f
orl
i
sts
,tu
ples
,st
ri
ngs
,by
te s
… Sequence Containers Indexing
n
egat
iv
ein
dex -5 -4 -3 -2 -1 Items count Individual access to items via lst[i
nde
x]
pos
it
iv
einde
x 0 1 2 3 4 len(lst)→5 lst[0]→10 - fir
stone lst[1]→20
lst=[10, 20, 30, 40, 50] lst[-1]→50 -l asto
ne lst[-2]→40
pos
it
iv
esl
ic
e 0 1 2 3 4 5 ☝ index from 0
Onmutab
les
equ
ence
s(list)
,re
movewit
h
n
egat
iv
esl
i
ce -5 -4 -3 -2 -1 (here from 0 to 4)
del lst[3]andmodi
fywi
thass
ign
ment
lst[4]=25
Access to sub-sequences via lst[s
tar
tsl
i
ce:e
nds
li
ce:s
te
p]
lst[:-1]C[10,20,30,40] lst[::-1]C[50,40,30,20,10] lst[1:3]C[20,30] lst[:3]C[10,20,30]
lst[1:-1]C[20,30,40] lst[::-2]C[50,30,10] lst[-3:-1]C[30,40] lst[3:]C[40,50]
lst[::2]C[10,30,50] lst[:]C[10,20,30,40,50]sh
all
owco
pyofs
equ
enc
e
Mis
si
ngsl
icei
ndi
cat
ion" fr
oms t
art
/upt
oen
d.
Onmuta
blese
quenc
es(list),r
emovewi
t
hdel lst[3:5]a
ndmo
dif
ywi
t
has
si
gnme
ntlst[1:4]=[15,25]
-
neo
usl
y H ☝ modules and packages searched in p
yth
onp
ath(cf sys.path)
a or b logical or oneorot
her p
arent
stat
ement
: s
tat
ement
blo
ckexec
ute
don
ly
orb o
th Conditional Statement
s
tat
ementbl
ock2… i
fac o
ndi
ti
onist
rue
☝ pitfall : anda ndorre
tur
nv al
ueofao
r ⌅ yes no yes
ofb( un de rsho
rtc
ute
val
uat
ion)
. if l
ogi
calc
ondi
ti
on: ? ?
-e n su ret ha
taa ndbareboo
leans
. no
not a logical not
n
ext
sta
teme
nta
fte
rbl
ock1 s
tat
ement
sbl
ock
True Can go with several el i
f,el
if... and only one
True and False constants ☝config
uree
dit
ort
oinse
rt4sp
ace
sin if age<=18:
False 8nal else
. Only the block of 8rst true
pl
aceo fani
ndent
at
io
ntab. state="Kid"
condition is executed. elif age>65:
☝ flo
ati
ngn
umb
ers
…ap
pro
xima
tedv
alu
es a
ngl
esi
nra
dia
ns Maths ☝ wi
t
havarx: state="Retired"
if bool(x)==True: ⇔ if x: else:
Operators: + - * / // % ** from math import sin,pi… state="Active"
if bool(x)==False:⇔ if not x:
Priority (…) × ÷ ab sin(pi/4)C0.707…
integer ÷ ÷ remainder cos(2*pi/3)C-0.4999… Exceptions on Errors
Signaling an error:
@→ matrix × pyt
hon3.
5+numpy sqrt(81)C9.0 N raise Ex c Cl
ass(
…) err
or
(1+5.3)*2C12.6 log(e**2)C2.0 Errors processing: n
orma
l proc
ess
ing
abs(-3.2)C3.2 ceil(12.5)C13 try: raiseX(
) e
rro
rraise
p
roc
ess
ing p
roce
ssi
ng
round(3.57,1)C3.6 floor(12.5)C12 nor ma lproc esi
si
ngblo
ck
pow(4,3)C64.0 modules math, statistics, random, except Ex c
ep t
i on as e: ☝ finallybl
ockf
orfin
alp
roc
ess
ing
☝usu alo rdero fop er
ati
ons decimal, fractions, numpy, etc.(c
f.do
c) errorp roces si
ngblo
ck inallc
ase
s.
37
!
s s
tat
ement
sbl
ockex
ecu
tedasl
ongas Conditional Loop Statement s
tat
ement
sbl
ockexe
cut
edfo
re h Iterative Loop Statement
ac
c
ondit
i
onist
rue i
t
emo facon
tai
nerori
te
rat
or
op
yes next
for var in s
eque
nce
o
while l
ogi
calco
ndit
i
on: :
el
? Loop Control …
t
no 8nish
fini
s
tat
ement
sblo
ck break i
mmediat
eexi
t st
at
emen
tsb
loc
k
continue n
exti
ter
ati
on
fin
s=0 i
ni
ti
al
iz
ati
onsbe
for
ethel
oop ☝ elseblo
ckf
orno
rmal Go over sequence's values
eo
i=1 c loopex
it
. s = "Some text" i ni
ti
ali
za t
i
onsbe
for
etheloop
ondi
ti
onwithal
east
oneva
ria
blev
alu
e(h
erei)
r
wa
Al
go: cnt = 0
s= ∑ i 2
☝b
s = s + i**2 for c in s:
i=i+1 ☝ ma
kec
ondi
t
ionv
ari
abl
ech
ang
e! if c == "e": Algo:cou
nt
print("sum:",s) i=1 cnt = cnt + 1 numb e
rofe
print("found",cnt,"'e'") inth
es t
ri
ng.
print("v=",3,"cmO:",x,",",y+4) Display loop on dict/set ⇔ loop on keys sequences
use sl
icesto loop on a subset of a sequence
f.readline() "n e
x t
li
ne nom "{1:>10s}".format(8,"toto")
☝te
xtmodetb yde fault(rea d/writestr) ,p ossibl
eb i
nar
y 0.nom →' toto'
4[key] "{x!r}".format(x="I'm")
modeb( read/wr i
tebytes) .Co n ver
tfrom/ tor equi r
edtype! 0[2]
f.close() ☝ dont forget to close the le after use ! →'"I\'m"'
◽ Formatting :
f.flush()write cache f.truncate([ size ]
) resize fil
lch ara lig n
men
tsi
gn mi
niwi
dth
.pre
cis
ion~ma
xwi
dt
h type
re
adi
ng/
wri
t
ingpr
ogressseq u
e nti
a l
lyi nt hefile,modifiablewi th:
<>^= +-s pac
e 0at start for 8lling with 0
f.tell()"p
osi
t
ion f.seek(p
osi
t
ion[
,or
igi
n]) integer: bbinary, cchar, ddecimal (default), ooctal, xor Xhexa…
Very common: opening with a guarded block with open(…) as f: Moat: eor Eexponential, for F8xed point, gor Gappropriate (default),
(automatic closing) and reading loop on lines for line in fO: string: s… %percent
of a text 8le: #p r
oces
si
ngo
fline ◽ Conversion : s(readable text) or r(literal representation)
Dictionaries store connections between pieces of
List comprehensions information. Each item in a dictionary is a key-value pair.
squares = [x**2 for x in range(1, 11)] A simple dictionary
Slicing a list alien = {'color': 'green', 'points': 5}
finishers = ['sam', 'bob', 'ada', 'bea']
Accessing a value
first_two = finishers[:2]
print("The alien's color is " + alien['color'])
Variables are used to store values. A string is a series of Copying a list
characters, surrounded by single or double quotes. Adding a new key-value pair
copy_of_bikes = bikes[:]
Hello world alien['x_position'] = 0
print("Hello world!") Looping through all key-value pairs
Hello world with a variable Tuples are similar to lists, but the items in a tuple can't be fav_numbers = {'eric': 17, 'ever': 4}
modified. for name, number in fav_numbers.items():
msg = "Hello world!"
print(name + ' loves ' + str(number))
print(msg) Making a tuple
Concatenation (combining strings) dimensions = (1920, 1080)
Looping through all keys
fav_numbers = {'eric': 17, 'ever': 4}
first_name = 'albert'
for name in fav_numbers.keys():
last_name = 'einstein'
print(name + ' loves a number')
full_name = first_name + ' ' + last_name If statements are used to test for particular conditions and
print(full_name) respond appropriately. Looping through all the values
Conditional tests fav_numbers = {'eric': 17, 'ever': 4}
for number in fav_numbers.values():
equals x == 42 print(str(number) + ' is a favorite')
A list stores a series of items in a particular order. You
not equal x != 42
access items using an index, or within a loop.
greater than x > 42
Make a list or equal to x >= 42
less than x < 42 Your programs can prompt the user for input. All input is
bikes = ['trek', 'redline', 'giant'] or equal to x <= 42 stored as a string.
Get the first item in a list Conditional test with lists Prompting for a value
first_bike = bikes[0] 'trek' in bikes name = input("What's your name? ")
Get the last item in a list 'surly' not in bikes print("Hello, " + name + "!")
last_bike = bikes[-1] Assigning boolean values Prompting for numerical input
Looping through a list game_active = True age = input("How old are you? ")
can_edit = False age = int(age)
for bike in bikes:
print(bike) A simple if test
pi = input("What's the value of pi? ")
Adding items to a list if age >= 18: pi = float(pi)
print("You can vote!")
bikes = []
bikes.append('trek') If-elif-else statements
bikes.append('redline') if age < 4:
bikes.append('giant') ticket_price = 0
Making numerical lists elif age < 18: Covers Python 3 and Python 2
ticket_price = 10
squares = [] else:
for x in range(1, 11): ticket_price = 15
squares.append(x**2)
38
A while loop repeats a block of code as long as a certain A class defines the behavior of an object and the kind of Your programs can read from files and write to files. Files
condition is true. information an object can store. The information in a class are opened in read mode ('r') by default, but can also be
is stored in attributes, and functions that belong to a class opened in write mode ('w') and append mode ('a').
A simple while loop are called methods. A child class inherits the attributes and
methods from its parent class. Reading a file and storing its lines
current_value = 1
while current_value <= 5: filename = 'siddhartha.txt'
Creating a dog class
print(current_value) with open(filename) as file_object:
current_value += 1 class Dog(): lines = file_object.readlines()
"""Represent a dog."""
Letting the user choose when to quit for line in lines:
msg = '' def __init__(self, name): print(line)
while msg != 'quit': """Initialize dog object."""
self.name = name Writing to a file
msg = input("What's your message? ")
print(msg) filename = 'journal.txt'
def sit(self): with open(filename, 'w') as file_object:
"""Simulate sitting.""" file_object.write("I love programming.")
print(self.name + " is sitting.")
Functions are named blocks of code, designed to do one Appending to a file
specific job. Information passed to a function is called an my_dog = Dog('Peso')
argument, and information received by a function is called a filename = 'journal.txt'
parameter. with open(filename, 'a') as file_object:
print(my_dog.name + " is a great dog!") file_object.write("\nI love making games.")
A simple function my_dog.sit()
make_pizza()
make_pizza('pepperoni')
If you had infinite programming skills, what would you Simple is better than complex
Returning a value build?
If you have a choice between a simple and a complex
def add_numbers(x, y): As you're learning to program, it's helpful to think solution, and both work, use the simple solution. Your
"""Add two numbers and return the sum.""" about the real-world projects you'd like to create. It's
return x + y code will be easier to maintain, and it will be easier
a good habit to keep an "ideas" notebook that you for you and others to build on that code later on.
can refer to whenever you want to start a new project.
sum = add_numbers(3, 5)
print(sum)
If you haven't done so already, take a few minutes
and describe three projects you'd like to create. More cheat sheets available at
39
You can add elements to the end of a list, or you can insert The sort() method changes the order of a list permanently.
them wherever you like in a list. The sorted() function returns a copy of the list, leaving the
original list unchanged. You can sort the items in a list in
Adding an element to the end of the list alphabetical order, or reverse alphabetical order. You can
users.append('amy') also reverse the original order of the list. Keep in mind that
lowercase and uppercase letters may affect the sort order.
Starting with an empty list
Sorting a list permanently
users = []
A list stores a series of items in a particular order. users.append('val') users.sort()
Lists allow you to store sets of information in one users.append('bob') Sorting a list permanently in reverse alphabetical
place, whether you have just a few items or millions users.append('mia')
order
of items. Lists are one of Python's most powerful Inserting elements at a particular position
features readily accessible to new programmers, and users.sort(reverse=True)
they tie together many important concepts in users.insert(0, 'joe')
Sorting a list temporarily
programming. users.insert(3, 'bea')
print(sorted(users))
print(sorted(users, reverse=True))
You can remove elements by their position in a list, or by Reversing the order of a list
Use square brackets to define a list, and use commas to
separate individual items in the list. Use plural names for the value of the item. If you remove an item by its value, users.reverse()
lists, to make your code easier to read. Python removes only the first item that has that value.
users = ['val', 'bob', 'mia', 'ron', 'ned'] del users[-1] Lists can contain millions of items, so Python provides an
efficient way to loop through all the items in a list. When
Removing an item by its value you set up a loop, Python pulls each item from the list one
users.remove('mia') at a time and stores it in a temporary variable, which you
Individual elements in a list are accessed according to their provide a name for. This name should be the singular
position, called the index. The index of the first element is version of the list name.
0, the index of the second element is 1, and so forth. The indented block of code makes up the body of the
Negative indices refer to items at the end of the list. To get If you want to work with an element that you're removing loop, where you can work with each individual item. Any
a particular element, write the name of the list and then the from the list, you can "pop" the element. If you think of the lines that are not indented run after the loop is completed.
index of the element in square brackets. list as a stack of items, pop() takes an item off the top of the
stack. By default pop() returns the last element in the list,
Printing all items in a list
Getting the first element but you can also pop elements from any position in the list. for user in users:
first_user = users[0] print(user)
Pop the last item from a list
Getting the second element most_recent_user = users.pop() Printing a message for each item, and a separate
print(most_recent_user) message afterwards
second_user = users[1]
for user in users:
Getting the last element Pop the first item in a list
print("Welcome, " + user + "!")
newest_user = users[-1] first_user = users.pop(0)
print(first_user) print("Welcome, we're glad to see you all!")
Use curly braces to define a dictionary. Use colons to Looping through all the keys
connect keys and values, and use commas to separate You can modify the value associated with any key in a # Show everyone who's taken the survey.
individual key-value pairs. dictionary. To do so give the name of the dictionary and for name in fav_languages.keys():
enclose the key in square brackets, then provide the new print(name)
Making a dictionary
value for that key.
alien_0 = {'color': 'green', 'points': 5} Looping through all the values
Modifying values in a dictionary
# Show all the languages that have been chosen.
alien_0 = {'color': 'green', 'points': 5} for language in fav_languages.values():
print(alien_0) print(language)
To access the value associated with an individual key give
the name of the dictionary and then place the key in a set of # Change the alien's color and point value. Looping through all the keys in order
square brackets. If the key you're asking for is not in the alien_0['color'] = 'yellow'
dictionary, an error will occur. # Show each person's favorite language,
alien_0['points'] = 10 # in order by the person's name.
You can also use the get() method, which returns None print(alien_0)
instead of an error if the key doesn't exist. You can also for name in sorted(fav_languages.keys()):
specify a default value to use if the key is not in the print(name + ": " + language)
dictionary.
Getting the value associated with a key You can remove any key-value pair you want from a
dictionary. To do so use the del keyword and the dictionary
You can find the number of key-value pairs in a dictionary.
alien_0 = {'color': 'green', 'points': 5} name, followed by the key in square brackets. This will
delete the key and its associated value. Finding a dictionary's length
print(alien_0['color'])
print(alien_0['points']) Deleting a key-value pair num_responses = len(fav_languages)
alien_0 = {'color': 'green', 'points': 5}
Getting the value with get()
print(alien_0)
alien_0 = {'color': 'green'}
del alien_0['points']
alien_color = alien_0.get('color') print(alien_0)
alien_points = alien_0.get('points', 0) Covers Python 3 and Python 2
print(alien_color)
print(alien_points) Try running some of these examples on pythontutor.com.
42
It's sometimes useful to store a set of dictionaries in a list; Storing a list inside a dictionary alows you to associate Standard Python dictionaries don't keep track of the order
this is called nesting. more than one value with each key. in which keys and values are added; they only preserve the
association between each key and its value. If you want to
Storing dictionaries in a list Storing lists in a dictionary preserve the order in which keys and values are added, use
# Start with an empty list. # Store multiple languages for each person. an OrderedDict.
users = [] fav_languages = { Preserving the order of keys and values
'jen': ['python', 'ruby'],
# Make a new user, and add them to the list. 'sarah': ['c'], from collections import OrderedDict
new_user = { 'edward': ['ruby', 'go'],
'last': 'fermi', 'phil': ['python', 'haskell'], # Store each person's languages, keeping
'first': 'enrico', } # track of who respoded first.
'username': 'efermi', fav_languages = OrderedDict()
} # Show all responses for each person.
users.append(new_user) for name, langs in fav_languages.items(): fav_languages['jen'] = ['python', 'ruby']
print(name + ": ") fav_languages['sarah'] = ['c']
# Make another new user, and add them as well. for lang in langs: fav_languages['edward'] = ['ruby', 'go']
new_user = { print("- " + lang) fav_languages['phil'] = ['python', 'haskell']
'last': 'curie',
'first': 'marie', # Display the results, in the same order they
'username': 'mcurie', # were entered.
} You can store a dictionary inside another dictionary. In this for name, langs in fav_languages.items():
users.append(new_user) case each value associated with a key is itself a dictionary. print(name + ":")
for lang in langs:
Storing dictionaries in a dictionary print("- " + lang)
# Show all information about each user.
for user_dict in users: users = {
for k, v in user_dict.items(): 'aeinstein': {
print(k + ": " + v) 'first': 'albert',
'last': 'einstein', You can use a loop to generate a large number of
print("\n")
'location': 'princeton', dictionaries efficiently, if all the dictionaries start out with
You can also define a list of dictionaries directly, }, similar data.
without using append(): 'mcurie': { A million aliens
# Define a list of users, where each user 'first': 'marie',
'last': 'curie', aliens = []
# is represented by a dictionary.
users = [ 'location': 'paris',
}, # Make a million green aliens, worth 5 points
{ # each. Have them all start in one row.
'last': 'fermi', }
for alien_num in range(1000000):
'first': 'enrico', new_alien = {}
'username': 'efermi', for username, user_dict in users.items():
print("\nUsername: " + username) new_alien['color'] = 'green'
}, new_alien['points'] = 5
{ full_name = user_dict['first'] + " "
full_name += user_dict['last'] new_alien['x'] = 20 * alien_num
'last': 'curie', new_alien['y'] = 0
'first': 'marie', location = user_dict['location']
aliens.append(new_alien)
'username': 'mcurie',
}, print("\tFull name: " + full_name.title())
print("\tLocation: " + location.title()) # Prove the list contains a million aliens.
] num_aliens = len(aliens)
# Show all information about each user. print("Number of aliens created:")
for user_dict in users: Nesting is extremely useful in certain situations. However, print(num_aliens)
for k, v in user_dict.items(): be aware of making your code overly complex. If you're
print(k + ": " + v) nesting items much deeper than what you see here there
print("\n") are probably simpler ways of managing your data, such as More cheat sheets available at
using classes.
43
Testing numerical values is similar to testing string values. Several kinds of if statements exist. Your choice of which to
use depends on the number of conditions you need to test.
Testing equality and inequality You can have as many elif blocks as you need, and the
>>> age = 18 else block is always optional.
>>> age == 18 Simple if statement
True
>>> age != 18 age = 19
False
if age >= 18:
Comparison operators print("You're old enough to vote!")
>>> age = 19 If-else statements
>>> age < 21
True age = 17
>>> age <= 21
True if age >= 18:
If statements allow you to examine the current state >>> age > 21 print("You're old enough to vote!")
of a program and respond appropriately to that state. False else:
>>> age >= 21 print("You can't vote yet.")
You can write a simple if statement that checks one
False
condition, or you can create a complex series of if The if-elif-else chain
statements that idenitfy the exact conditions you're
age = 12
looking for.
You can check multiple conditions at the same time. The if age < 4:
While loops run as long as certain conditions remain and operator returns True if all the conditions listed are price = 0
true. You can use while loops to let your programs True. The or operator returns True if any condition is True. elif age < 18:
run as long as your users want them to. Using and to check multiple conditions price = 5
else:
>>> age_0 = 22 price = 10
>>> age_1 = 18
A conditional test is an expression that can be evaluated as >>> age_0 >= 21 and age_1 >= 21 print("Your cost is $" + str(price) + ".")
True or False. Python uses the values True and False to False
decide whether the code in an if statement should be >>> age_1 = 23
executed. >>> age_0 >= 21 and age_1 >= 21
True You can easily test whether a certain value is in a list. You
Checking for equality
A single equal sign assigns a value to a variable. A double equal can also test whether a list is empty before trying to loop
Using or to check multiple conditions through the list.
sign (==) checks whether two values are equal.
>>> age_0 = 22 Testing if a value is in a list
>>> car = 'bmw'
>>> age_1 = 18
>>> car == 'bmw' >>> players = ['al', 'bea', 'cyn', 'dale']
>>> age_0 >= 21 or age_1 >= 21
True >>> 'al' in players
True
>>> car = 'audi' True
>>> age_0 = 18
>>> car == 'bmw' >>> 'eric' in players
>>> age_0 >= 21 or age_1 >= 21
False False
False
Ignoring case when making a comparison
>>> car = 'Audi'
>>> car.lower() == 'audi' A boolean value is either True or False. Variables with
True boolean values are often used to keep track of certain
conditions within a program.
Checking for inequality Covers Python 3 and Python 2
Simple boolean values
>>> topping = 'mushrooms'
>>> topping != 'anchovies' game_active = True
True can_edit = False
44
Testing if a value is not in a list Letting the user choose when to quit Using continue in a loop
banned_users = ['ann', 'chad', 'dee'] prompt = "\nTell me something, and I'll " banned_users = ['eve', 'fred', 'gary', 'helen']
user = 'erin' prompt += "repeat it back to you."
prompt += "\nEnter 'quit' to end the program. " prompt = "\nAdd a player to your team."
if user not in banned_users: prompt += "\nEnter 'quit' when you're done. "
print("You can play!") message = ""
while message != 'quit': players = []
Checking if a list is empty message = input(prompt) while True:
players = [] player = input(prompt)
if message != 'quit': if player == 'quit':
if players: print(message) break
for player in players: elif player in banned_users:
Using a flag print(player + " is banned!")
print("Player: " + player.title())
else: prompt = "\nTell me something, and I'll " continue
print("We have no players yet!") prompt += "repeat it back to you." else:
prompt += "\nEnter 'quit' to end the program. " players.append(player)
Simple input
if message == 'quit':
name = input("What's your name? ") active = False Every while loop needs a way to stop running so it won't
print("Hello, " + name + ".") else: continue to run forever. If there's no way for the condition to
print(message) become False, the loop will never stop running.
Accepting numerical input
Using break to exit a loop An infinite loop
age = input("How old are you? ")
age = int(age) prompt = "\nWhat cities have you visited?" while True:
prompt += "\nEnter 'quit' when you're done. " name = input("\nWho are you? ")
if age >= 18: print("Nice to meet you, " + name + "!")
print("\nYou can vote!") while True:
else: city = input(prompt)
print("\nYou can't vote yet.")
if city == 'quit': The remove() method removes a specific value from a list,
Accepting input in Python 2.7 break
Use raw_input() in Python 2.7. This function interprets all input as a
but it only removes the first instance of the value you
string, just as input() does in Python 3.
else: provide. You can use a while loop to remove all instances
print("I've been to " + city + "!") of a particular value.
name = raw_input("What's your name? ")
print("Hello, " + name + ".") Removing all cats from a list of pets
pets = ['dog', 'cat', 'dog', 'fish', 'cat',
Sublime Text doesn't run programs that prompt the user for
'rabbit', 'cat']
input. You can use Sublime Text to write programs that
prompt for input, but you'll need to run these programs from print(pets)
A while loop repeats a block of code as long as a condition
is True. a terminal.
while 'cat' in pets:
Counting to 5 pets.remove('cat')
current_number = 1 print(pets)
You can use the break statement and the continue
statement with any of Python's loops. For example you can
while current_number <= 5: use break to quit a for loop that's working through a list or a
print(current_number) dictionary. You can use continue to skip over certain items More cheat sheets available at
current_number += 1 when looping through a list or dictionary as well.
45
The two main kinds of arguments are positional and A function can return a value or a set of values. When a
keyword arguments. When you use positional arguments function returns a value, the calling line must provide a
Python matches the first argument in the function call with variable in which to store the return value. A function stops
the first parameter in the function definition, and so forth. running when it reaches a return statement.
With keyword arguments, you specify which parameter
each argument should be assigned to in the function call. Returning a single value
When you use keyword arguments, the order of the def get_full_name(first, last):
arguments doesn't matter. """Return a neatly formatted full name."""
Using positional arguments full_name = first + ' ' + last
return full_name.title()
def describe_pet(animal, name):
Functions are named blocks of code designed to do """Display information about a pet.""" musician = get_full_name('jimi', 'hendrix')
print("\nI have a " + animal + ".") print(musician)
one specific job. Functions allow you to write code
print("Its name is " + name + ".")
once that can then be run whenever you need to Returning a dictionary
accomplish the same task. Functions can take in the describe_pet('hamster', 'harry') def build_person(first, last):
information they need, and return the information they describe_pet('dog', 'willie') """Return a dictionary of information
generate. Using functions effectively makes your about a person.
programs easier to write, read, test, and fix. Using keyword arguments
"""
def describe_pet(animal, name): person = {'first': first, 'last': last}
"""Display information about a pet.""" return person
The first line of a function is its definition, marked by the print("\nI have a " + animal + ".")
keyword def. The name of the function is followed by a set print("Its name is " + name + ".") musician = build_person('jimi', 'hendrix')
of parentheses and a colon. A docstring, in triple quotes, print(musician)
describes what the function does. The body of a function is describe_pet(animal='hamster', name='harry')
describe_pet(name='willie', animal='dog') Returning a dictionary with optional values
indented one level.
To call a function, give the name of the function followed def build_person(first, last, age=None):
by a set of parentheses. """Return a dictionary of information
about a person.
Making a function You can provide a default value for a parameter. When
"""
function calls omit this argument the default value will be
def greet_user(): used. Parameters with default values must be listed after person = {'first': first, 'last': last}
"""Display a simple greeting.""" parameters without default values in the function's definition if age:
print("Hello!") so positional arguments can still work correctly. person['age'] = age
return person
greet_user() Using a default value
musician = build_person('jimi', 'hendrix', 27)
def describe_pet(name, animal='dog'):
print(musician)
"""Display information about a pet."""
Information that's passed to a function is called an print("\nI have a " + animal + ".")
musician = build_person('janis', 'joplin')
argument; information that's received by a function is called print("Its name is " + name + ".")
print(musician)
a parameter. Arguments are included in parentheses after
the function's name, and parameters are listed in describe_pet('harry', 'hamster')
parentheses in the function's definition. describe_pet('willie')
Try running some of these examples on pythontutor.com.
Passing a single argument Using None to make an argument optional
def greet_user(username): def describe_pet(animal, name=None):
"""Display a simple greeting.""" """Display information about a pet."""
print("Hello, " + username + "!") print("\nI have a " + animal + ".")
if name:
print("Its name is " + name + ".")
Covers Python 3 and Python 2
greet_user('jesse')
greet_user('diana')
greet_user('brandon') describe_pet('hamster', 'harry')
describe_pet('snake')
46
You can pass a list as an argument to a function, and the Sometimes you won't know how many arguments a You can store your functions in a separate file called a
function can work with the values in the list. Any changes function will need to accept. Python allows you to collect an module, and then import the functions you need into the file
the function makes to the list will affect the original list. You arbitrary number of arguments into one parameter using the containing your main program. This allows for cleaner
can prevent a function from modifying a list by passing a * operator. A parameter that accepts an arbitrary number of program files. (Make sure your module is stored in the
copy of the list as an argument. arguments must come last in the function definition. same directory as your main program.)
The ** operator allows a parameter to collect an arbitrary
Passing a list as an argument number of keyword arguments. Storing a function in a module
File: pizza.py
def greet_users(names): Collecting an arbitrary number of arguments
"""Print a simple greeting to everyone.""" def make_pizza(size, *toppings):
for name in names: def make_pizza(size, *toppings): """Make a pizza."""
msg = "Hello, " + name + "!" """Make a pizza.""" print("\nMaking a " + size + " pizza.")
print(msg) print("\nMaking a " + size + " pizza.") print("Toppings:")
print("Toppings:") for topping in toppings:
usernames = ['hannah', 'ty', 'margot'] for topping in toppings: print("- " + topping)
greet_users(usernames) print("- " + topping)
Importing an entire module
Allowing a function to modify a list File: making_pizzas.py
# Make three pizzas with different toppings. Every function in the module is available in the program file.
The following example sends a list of models to a function for
make_pizza('small', 'pepperoni')
printing. The original list is emptied, and the second list is filled. import pizza
make_pizza('large', 'bacon bits', 'pineapple')
def print_models(unprinted, printed): make_pizza('medium', 'mushrooms', 'peppers',
"""3d print a set of models.""" 'onions', 'extra cheese') pizza.make_pizza('medium', 'pepperoni')
while unprinted: pizza.make_pizza('small', 'bacon', 'pineapple')
current_model = unprinted.pop() Collecting an arbitrary number of keyword arguments
Importing a specific function
print("Printing " + current_model) def build_profile(first, last, **user_info): Only the imported functions are available in the program file.
printed.append(current_model) """Build a user's profile dictionary."""
# Build a dict with the required keys. from pizza import make_pizza
# Store some unprinted designs, profile = {'first': first, 'last': last}
# and print each of them. make_pizza('medium', 'pepperoni')
unprinted = ['phone case', 'pendant', 'ring'] # Add any other keys and values. make_pizza('small', 'bacon', 'pineapple')
printed = [] for key, value in user_info.items():
print_models(unprinted, printed) Giving a module an alias
profile[key] = value
import pizza as p
print("\nUnprinted:", unprinted) return profile
print("Printed:", printed) p.make_pizza('medium', 'pepperoni')
# Create two users with different kinds p.make_pizza('small', 'bacon', 'pineapple')
Preventing a function from modifying a list
The following example is the same as the previous one, except the # of information.
user_0 = build_profile('albert', 'einstein',
Giving a function an alias
original list is unchanged after calling print_models().
location='princeton') from pizza import make_pizza as mp
def print_models(unprinted, printed): user_1 = build_profile('marie', 'curie',
"""3d print a set of models.""" location='paris', field='chemistry') mp('medium', 'pepperoni')
while unprinted: mp('small', 'bacon', 'pineapple')
current_model = unprinted.pop() print(user_0)
print("Printing " + current_model) print(user_1) Importing all functions from a module
printed.append(current_model) Don't do this, but recognize it when you see it in others' code. It
can result in naming conflicts, which can cause errors.
# Store some unprinted designs, from pizza import *
# and print each of them. As you can see there are many ways to write and call a
original = ['phone case', 'pendant', 'ring'] function. When you're starting out, aim for something that make_pizza('medium', 'pepperoni')
printed = [] simply works. As you gain experience you'll develop an make_pizza('small', 'bacon', 'pineapple')
understanding of the more subtle advantages of different
print_models(original[:], printed) structures such as positional and keyword arguments, and
print("\nOriginal:", original) the various approaches to importing functions. For now if More cheat sheets available at
print("Printed:", printed) your functions do what you need them to, you're doing well.
47
If the class you're writing is a specialized version of another
Creating an object from a class class, you can use inheritance. When one class inherits
my_car = Car('audi', 'a4', 2016) from another, it automatically takes on all the attributes and
methods of the parent class. The child class is free to
Accessing attribute values introduce new attributes and methods, and override
attributes and methods of the parent class.
print(my_car.make)
To inherit from another class include the name of the
print(my_car.model)
parent class in parentheses when defining the new class.
print(my_car.year)
Classes are the foundation of object-oriented The __init__() method for a child class
Calling methods
programming. Classes represent real-world things
class ElectricCar(Car):
you want to model in your programs: for example my_car.fill_tank()
"""A simple model of an electric car."""
dogs, cars, and robots. You use a class to make my_car.drive()
objects, which are specific instances of dogs, cars, Creating multiple objects def __init__(self, make, model, year):
and robots. A class defines the general behavior that """Initialize an electric car."""
a whole category of objects can have, and the my_car = Car('audi', 'a4', 2016) super().__init__(make, model, year)
information that can be associated with those objects. my_old_car = Car('subaru', 'outback', 2013)
my_truck = Car('toyota', 'tacoma', 2010) # Attributes specific to electric cars.
Classes can inherit from each other – you can
write a class that extends the functionality of an # Battery capacity in kWh.
existing class. This allows you to code efficiently for a self.battery_size = 70
# Charge level in %.
wide variety of situations. You can modify an attribute's value directly, or you can
self.charge_level = 0
write methods that manage updating values more carefully.
Modifying an attribute directly Adding new methods to the child class
Consider how we might model a car. What information class ElectricCar(Car):
my_new_car = Car('audi', 'a4', 2016)
would we associate with a car, and what behavior would it --snip--
my_new_car.fuel_level = 5
have? The information is stored in variables called def charge(self):
attributes, and the behavior is represented by functions. Writing a method to update an attribute's value """Fully charge the vehicle."""
Functions that are part of a class are called methods. self.charge_level = 100
def update_fuel_level(self, new_level): print("The vehicle is fully charged.")
The Car class """Update the fuel level."""
if new_level <= self.fuel_capacity: Using child methods and parent methods
class Car():
self.fuel_level = new_level
"""A simple attempt to model a car.""" my_ecar = ElectricCar('tesla', 'model s', 2016)
else:
print("The tank can't hold that much!")
def __init__(self, make, model, year): my_ecar.charge()
"""Initialize car attributes.""" Writing a method to increment an attribute's value my_ecar.drive()
self.make = make
self.model = model def add_fuel(self, amount):
self.year = year """Add fuel to the tank."""
if (self.fuel_level + amount
# Fuel capacity and level in gallons. <= self.fuel_capacity): There are many ways to model real world objects and
self.fuel_capacity = 15 self.fuel_level += amount situations in code, and sometimes that variety can feel
self.fuel_level = 0 print("Added fuel.") overwhelming. Pick an approach and try it – if your first
else: attempt doesn't work, try a different approach.
def fill_tank(self): print("The tank won't hold that much.")
"""Fill gas tank to capacity."""
self.fuel_level = self.fuel_capacity
print("Fuel tank is full.")
In Python class names are written in CamelCase and object Covers Python 3 and Python 2
def drive(self): names are written in lowercase with underscores. Modules
"""Simulate driving.""" that contain classes should still be named in lowercase with
print("The car is moving.") underscores.
48
Class files can get long as you add detailed information and
Overriding parent methods functionality. To help keep your program files uncluttered, Classes should inherit from object
class ElectricCar(Car): you can store your classes in modules and import the class ClassName(object):
--snip-- classes you need into your main program.
def fill_tank(self): The Car class in Python 2.7
Storing classes in a file
"""Display an error message.""" car.py class Car(object):
print("This car has no fuel tank!")
"""Represent gas and electric cars.""" Child class __init__() method is different
class ChildClassName(ParentClass):
class Car():
def __init__(self):
A class can have objects as attributes. This allows classes """A simple attempt to model a car."""
super(ClassName, self).__init__()
to work together to model complex situations. --snip—
The ElectricCar class in Python 2.7
A Battery class class Battery():
"""A battery for an electric car.""" class ElectricCar(Car):
class Battery(): def __init__(self, make, model, year):
"""A battery for an electric car.""" --snip--
super(ElectricCar, self).__init__(
class ElectricCar(Car): make, model, year)
def __init__(self, size=70):
"""Initialize battery attributes.""" """A simple model of an electric car."""
# Capacity in kWh, charge level in %. --snip--
self.size = size Importing individual classes from a module A list can hold as many items as you want, so you can
self.charge_level = 0 my_cars.py make a large number of objects from a class and store
them in a list.
def get_range(self): from car import Car, ElectricCar Here's an example showing how to make a fleet of rental
"""Return the battery's range.""" cars, and make sure all the cars are ready to drive.
if self.size == 70: my_beetle = Car('volkswagen', 'beetle', 2016)
my_beetle.fill_tank() A fleet of rental cars
return 240
elif self.size == 85: my_beetle.drive() from car import Car, ElectricCar
return 270
my_tesla = ElectricCar('tesla', 'model s', # Make lists to hold a fleet of cars.
Using an instance as an attribute 2016) gas_fleet = []
my_tesla.charge() electric_fleet = []
class ElectricCar(Car):
my_tesla.drive()
--snip--
Importing an entire module # Make 500 gas cars and 250 electric cars.
def __init__(self, make, model, year): for _ in range(500):
"""Initialize an electric car.""" import car car = Car('ford', 'focus', 2016)
super().__init__(make, model, year) gas_fleet.append(car)
my_beetle = car.Car( for _ in range(250):
# Attribute specific to electric cars. 'volkswagen', 'beetle', 2016) ecar = ElectricCar('nissan', 'leaf', 2016)
self.battery = Battery() my_beetle.fill_tank() electric_fleet.append(ecar)
my_beetle.drive()
def charge(self): # Fill the gas cars, and charge electric cars.
"""Fully charge the vehicle.""" my_tesla = car.ElectricCar( for car in gas_fleet:
self.battery.charge_level = 100 'tesla', 'model s', 2016) car.fill_tank()
print("The vehicle is fully charged.") my_tesla.charge() for ecar in electric_fleet:
my_tesla.drive() ecar.charge()
Using the instance
Importing all classes from a module
my_ecar = ElectricCar('tesla', 'model x', 2016) (Don’t do this, but recognize it when you see it.) print("Gas cars:", len(gas_fleet))
print("Electric cars:", len(electric_fleet))
from car import *
my_ecar.charge()
print(my_ecar.battery.get_range()) More cheat sheets available at
my_beetle = Car('volkswagen', 'beetle', 2016)
my_ecar.drive()
49
Storing the lines in a list Opening a file using an absolute path
filename = 'siddhartha.txt' f_path = "/home/ehmatthes/books/alice.txt"
import unittest E
from full_names import get_full_name ================================================
ERROR: test_first_last (__main__.NamesTestCase)
class NamesTestCase(unittest.TestCase): Test names like Janis Joplin.
"""Tests for names.py.""" ------------------------------------------------
Traceback (most recent call last):
def test_first_last(self): File "test_full_names.py", line 10,
When you write a function or a class, you can also """Test names like Janis Joplin.""" in test_first_last
write tests for that code. Testing proves that your full_name = get_full_name('janis', 'joplin')
code works as it's supposed to in the situations it's 'joplin') TypeError: get_full_name() missing 1 required
designed to handle, and also when people use your self.assertEqual(full_name, positional argument: 'last'
programs in unexpected ways. Writing tests gives 'Janis Joplin')
you confidence that your code will work correctly as ------------------------------------------------
more people begin to use your programs. You can unittest.main() Ran 1 test in 0.001s
also add new features to your programs and know Running the test
that you haven't broken existing behavior. FAILED (errors=1)
Python reports on each unit test in the test case. The dot reports a
single passing test. Python informs us that it ran 1 test in less than Fixing the code
A unit test verifies that one specific aspect of your 0.001 seconds, and the OK lets us know that all unit tests in the When a test fails, the code needs to be modified until the test
test case passed. passes again. (Don’t make the mistake of rewriting your tests to fit
code works as it's supposed to. A test case is a
your new code.) Here we can make the middle name optional.
collection of unit tests which verify your code's .
behavior in a wide variety of situations. --------------------------------------- def get_full_name(first, last, middle=''):
Ran 1 test in 0.000s """Return a full name."""
if middle:
OK full_name = "{0} {1} {2}".format(first,
Python's unittest module provides tools for testing your middle, last)
code. To try it out, we’ll create a function that returns a full else:
name. We’ll use the function in a regular program, and then full_name = "{0} {1}".format(first,
build a test case for the function. Failing tests are important; they tell you that a change in the
last)
code has affected existing behavior. When a test fails, you
A function to test return full_name.title()
need to modify the code so the existing behavior still works.
Save this as full_names.py
Modifying the function Running the test
def get_full_name(first, last): Now the test should pass again, which means our original
We’ll modify get_full_name() so it handles middle names, but
"""Return a full name.""" functionality is still intact.
we’ll do it in a way that breaks existing behavior.
full_name = "{0} {1}".format(first, last) .
return full_name.title() def get_full_name(first, middle, last):
---------------------------------------
"""Return a full name."""
Ran 1 test in 0.000s
Using the function full_name = "{0} {1} {2}".format(first,
Save this as names.py middle, last)
OK
return full_name.title()
from full_names import get_full_name
Using the function
janis = get_full_name('janis', 'joplin')
print(janis) from full_names import get_full_name
x_values = [0, 1, 2, 3, 4, 5]
squares = [0, 1, 4, 9, 16, 25]
Covers Python 3 and Python 2
plt.plot(x_values, squares)
plt.show()
56
You can make as many plots as you want on one figure. You can include as many individual graphs in one figure as
When you make multiple plots, you can emphasize Datetime formatting arguments you want. This is useful, for example, when comparing
The strftime() function generates a formatted string from a
relationships in the data. For example you can fill the space related datasets.
datetime object, and the strptime() function genereates a
between two sets of data. datetime object from a string. The following codes let you work with Sharing an x-axis
Plotting two sets of data dates exactly as you need to. The following code plots a set of squares and a set of cubes on
Here we use plt.scatter() twice to plot square numbers and %A Weekday name, such as Monday two separate graphs that share a common x-axis.
cubes on the same figure. The plt.subplots() function returns a figure object and a tuple
%B Month name, such as January of axes. Each set of axes corresponds to a separate plot in the
import matplotlib.pyplot as plt %m Month, as a number (01 to 12) figure. The first two arguments control the number of rows and
%d Day of the month, as a number (01 to 31) columns generated in the figure.
x_values = list(range(11)) %Y Four-digit year, such as 2016
%y Two-digit year, such as 16 import matplotlib.pyplot as plt
squares = [x**2 for x in x_values]
cubes = [x**3 for x in x_values] %H Hour, in 24-hour format (00 to 23)
%I Hour, in 12-hour format (01 to 12) x_vals = list(range(11))
%p AM or PM squares = [x**2 for x in x_vals]
plt.scatter(x_values, squares, c='blue',
%M Minutes (00 to 59) cubes = [x**3 for x in x_vals]
edgecolor='none', s=20)
plt.scatter(x_values, cubes, c='red', %S Seconds (00 to 61)
fig, axarr = plt.subplots(2, 1, sharex=True)
edgecolor='none', s=20)
Converting a string to a datetime object
axarr[0].scatter(x_vals, squares)
plt.axis([0, 11, 0, 1100]) new_years = dt.strptime('1/1/2017', '%m/%d/%Y')
axarr[0].set_title('Squares')
plt.show()
Converting a datetime object to a string
Filling the space between data sets axarr[1].scatter(x_vals, cubes, c='red')
The fill_between() method fills the space between two data ny_string = dt.strftime(new_years, '%B %d, %Y') axarr[1].set_title('Cubes')
sets. It takes a series of x-values and two series of y-values. It also print(ny_string)
takes a facecolor to use for the fill, and an optional alpha plt.show()
argument that controls the color’s transparency. Plotting high temperatures
The following code creates a list of dates and a corresponding list
Sharing a y-axis
plt.fill_between(x_values, cubes, squares, of high temperatures. It then plots the high temperatures, with the
To share a y-axis, we use the sharey=True argument.
facecolor='blue', alpha=0.25) date labels displayed in a specific format.
from datetime import datetime as dt import matplotlib.pyplot as plt
outcomes = [1, 2, 3, 4, 5, 6]
To make a plot with Pygal, you specify the kind of plot and frequencies = [18, 16, 18, 17, 18, 13]
then add the data.
chart = pygal.Bar()
Making a line graph
To view the output, open the file squares.svg in a browser.
chart.force_uri_protocol = 'http'
chart.x_labels = outcomes
import pygal chart.add('D6', frequencies)
chart.render_to_file('rolling_dice.svg')
x_values = [0, 1, 2, 3, 4, 5]
squares = [0, 1, 4, 9, 16, 25] Making a bar graph from a dictionary
Since each bar needs a label and a value, a dictionary is a great The documentation for Pygal is available at
way to store the data for a bar graph. The keys are used as the http://www.pygal.org/.
chart = pygal.Line() labels along the x-axis, and the values are used to determine the
chart.force_uri_protocol = 'http' height of each bar.
chart.add('x^2', squares)
chart.render_to_file('squares.svg') import pygal If you’re viewing svg output in a browser, Pygal needs to
render the output file in a specific way. The
Adding labels and a title results = { force_uri_protocol attribute for chart objects needs to
--snip-- 1:18, 2:16, 3:18, be set to 'http'.
chart = pygal.Line() 4:17, 5:18, 6:13,
chart.force_uri_protocol = 'http' }
chart.title = "Squares"
chart.x_labels = x_values chart = pygal.Bar()
chart.x_title = "Value" chart.force_uri_protocol = 'http' Covers Python 3 and Python 2
chart.y_title = "Square of Value" chart.x_labels = results.keys()
chart.add('x^2', squares) chart.add('D6', results.values())
chart.render_to_file('squares.svg') chart.render_to_file('rolling_dice.svg')
58
Pygal lets you customize many elements of a plot. There Pygal can generate world maps, and you can add any data
are some excellent default themes, and many options for Configuration settings you want to these maps. Data is indicated by coloring, by
Some settings are controlled by a Config object.
styling individual plot elements. labels, and by tooltips that show data when users hover
my_config = pygal.Config() over each country on the map.
Using built-in styles
To use built-in styles, import the style and make an instance of the
my_config.show_y_guides = False
Installing the world map module
style class. Then pass the style object with the style argument my_config.width = 1000 The world map module is not included by default in Pygal 2.0. It
when you make the chart object. my_config.dots_size = 5 can be installed with pip:
import pygal chart = pygal.Line(config=my_config) $ pip install --user pygal_maps_world
from pygal.style import LightGreenStyle --snip-- Making a world map
x_values = list(range(11)) Styling series The following code makes a simple world map showing the
You can give each series on a chart different style settings. countries of North America.
squares = [x**2 for x in x_values]
cubes = [x**3 for x in x_values] chart.add('Squares', squares, dots_size=2) from pygal.maps.world import World
chart.add('Cubes', cubes, dots_size=3)
chart_style = LightGreenStyle() wm = World()
chart = pygal.Line(style=chart_style) Styling individual data points wm.force_uri_protocol = 'http'
chart.force_uri_protocol = 'http' You can style individual data points as well. To do so, write a wm.title = 'North America'
chart.title = "Squares and Cubes" dictionary for each data point you want to customize. A 'value' wm.add('North America', ['ca', 'mx', 'us'])
chart.x_labels = x_values key is required, and other properies are optional.
import pygal wm.render_to_file('north_america.svg')
chart.add('Squares', squares)
chart.add('Cubes', cubes) Showing all the country codes
repos = [ In order to make maps, you need to know Pygal’s country codes.
chart.render_to_file('squares_cubes.svg') { The following example will print an alphabetical list of each country
'value': 20506, and its code.
Parametric built-in styles
Some built-in styles accept a custom color, then generate a theme 'color': '#3333CC',
'xlink': 'http://djangoproject.com/', from pygal.maps.world import COUNTRIES
based on that color.
},
from pygal.style import LightenStyle 20054, for code in sorted(COUNTRIES.keys()):
12607, print(code, COUNTRIES[code])
--snip-- 11827,
chart_style = LightenStyle('#336688')
Plotting numerical data on a world map
] To plot numerical data on a map, pass a dictionary to add()
chart = pygal.Line(style=chart_style) instead of a list.
--snip-- chart = pygal.Bar()
chart.force_uri_protocol = 'http' from pygal.maps.world import World
Customizing individual style properties
Style objects have a number of properties you can set individually. chart.x_labels = [
populations = {
'django', 'requests', 'scikit-learn',
chart_style = LightenStyle('#336688') 'ca': 34126000,
'tornado',
chart_style.plot_background = '#CCCCCC' ] 'us': 309349000,
chart_style.major_label_font_size = 20 chart.y_title = 'Stars' 'mx': 113423000,
chart_style.label_font_size = 16 chart.add('Python Repos', repos) }
--snip-- chart.render_to_file('python_repos.svg')
wm = World()
Custom style class wm.force_uri_protocol = 'http'
You can start with a bare style class, and then set only the wm.title = 'Population of North America'
properties you care about. wm.add('North America', populations)
chart_style = Style()
chart_style.colors = [ wm.render_to_file('na_populations.svg')
'#CCCCCC', '#AAAAAA', '#888888']
chart_style.plot_background = '#EEEEEE'
(works with about every distribution, except for apt-get which is Ubuntu/Debian exclusive)
Legend:
find = the best file search tool (fast) With regular expressions:
find -name “<fileName>”
find -name “text” = search for files who start with the word text grep -E ^<text> <fileName> = search start of lines
find -name “*text” = “ “ “ “ end “ “ “ “ with the word text
grep -E <0-4> <fileName> =shows lines containing numbers 0-4
Advanced Search: grep -E <a-zA-Z> <fileName> = retrieve all lines
with alphabetical letters
Search from file Size (in ~)
find ~ -size +10M = search files bigger than.. (M,K,G) sort = sort the content of files
sort <fileName> = sort alphabetically
Search from last access sort -o <file> <outputFile> = write result to a file
find -name “<filetype>” -atime -5 sort -r <fileName> = sort in reverse
('-' = less than, '+' = more than and nothing = exactly) sort -R <fileName> = sort randomly
sort -n <fileName> = sort numbers
Search only files or directory’s
find -type d --> ex: find /var/log -name "syslog" -type d wc = word count
find -type f = files wc <fileName> = nbr of line, nbr of words, byte size
-l (lines), -w (words), -c (byte size), -m
More info: man find, man locate (number of characters)
sleep = pause between commands jobs = know what is running in the background
with ';' you can chain commands, ex: touch file; rm file
you can make a pause between commands (minutes, hours, days) fg = put a background process to foreground
ex --> touch file; sleep 10; rm file <-- 10 seconds ex: fg (process 1), f%2 (process 2) f%3, ...
67
Linux Bash Shell Cheat Sheet
Basic Commands
git init Create empty Git repo in specified directory. Run with no arguments git commit Replace the last commit with the staged changes and last commit
<directory> to initialize the current directory as a git repository. --amend combined. Use with nothing staged to edit the last commit’s message.
git clone <repo> Clone repo located at <repo> onto local machine. Original repo can be git rebase <base> Rebase the current branch onto <base>. <base> can be a commit ID,
located on the local filesystem or on a remote machine via HTTP or SSH. a branch name, a tag, or a relative reference to HEAD.
git config Define author name to be used for all commits in current repo. Devs git reflog Show a log of changes to the local repository’s HEAD. Add
user.name <name> commonly use --global flag to set config options for current user. --relative-date flag to show date info or --all to show all refs.
git add Stage all changes in <directory> for the next commit.
<directory> Replace <directory> with a <file> to change a specific file. Git Branches
git commit -m Commit the staged snapshot, but instead of launching a text editor, git branch List all of the branches in your repo. Add a <branch> argument to
"<message>" use <message> as the commit message. create a new branch with the name <branch>.
git status List which files are staged, unstaged, and untracked. git checkout -b Create and check out a new branch named <branch>. Drop the -b
<branch> flag to checkout an existing branch.
git log Display the entire commit history using the default format. git merge <branch> Merge <branch> into the current branch.
For customization see additional options.
git diff Show unstaged changes between your index and working directory.
Remote Repositories
git remote add Create a new connection to a remote repo. After adding a remote,
Undoing Changes <name> <url> you can use <name> as a shortcut for <url> in other commands.
git revert Create new commit that undoes all of the changes made in git fetch Fetches a specific <branch>, from the repo. Leave off <branch> to
<commit> <commit>, then apply it to the current branch. <remote> <branch> fetch all remote refs.
git reset <file> Remove <file> from the staging area, but leave the working directory git pull <remote> Fetch the specified remote’s copy of current branch and immediately
unchanged. This unstages a file without overwriting any changes. merge it into the local copy.
git clean -n Shows which files would be removed from working directory. Use git push Push the branch to <remote>, along with necessary commits and
the -f flag in place of the -n flag to execute the clean. <remote> <branch> objects. Creates named branch in the remote repo if it doesn’t exist.
73
git config --global Define the author name to be used for all commits by the current user. git diff HEAD Show difference between working directory and last commit.
user.name <name> git diff --cached Show difference between staged changes and last commit
git config --global Define the author email to be used for all commits by the current user.
user.email <email>
git reset
git config --global
Create shortcut for a Git command. E.g. alias.glog log --graph git reset Reset staging area to match most recent commit, but leave the
alias. <alias-name>
--oneline will set git glog equivalent to git log --graph --oneline. working directory unchanged.
<git-command>
git config --system Set text editor used by commands for all users on the machine. <editor> git reset --hard Reset staging area and working directory to match most recent
core.editor <editor> arg should be the command that launches the desired editor (e.g., vi). commit and overwrites all changes in the working directory.
git config Open the global configuration file in a text editor for manual editing. git reset <commit> Move the current branch tip backward to <commit>, reset the
--global --edit staging area to match, but leave the working directory alone.
git reset --hard Same as previous, but resets both the staging area & working directory to
git log <commit> match. Deletes uncommitted changes, and all commits after <commit>.
git log -<limit> Limit number of commits by <limit>. E.g. git log -5 will limit to 5
commits. git rebase
git log --oneline Condense each commit to a single line. git rebase -i Interactively rebase current branch onto <base>. Launches editor to enter
git log -p Display the full diff of each commit. <base> commands for how each commit will be transferred to the new base.
git log --stat Include which files were altered and the relative number of lines
that were added or deleted from each of them. git pull
git log --author= Search for commits by a particular author. git pull --rebase Fetch the remote’s copy of current branch and rebases it into the local
”<pattern>” <remote> copy. Uses git rebase instead of merge to integrate the branches.
git log Search for commits with a commit message that matches <pattern>.
--grep=”<pattern>”
git push
git log Show commits that occur between <since> and <until>. Args can be a git push <remote> Forces the git push even if it results in a non-fast-forward merge. Do not use
<since>..<until> commit ID, branch name, HEAD, or any other kind of revision reference. --force the --force flag unless you’re absolutely sure you know what you’re doing.
git log -- <file> Only display commits that have the specified file. git push <remote> Push all of your local branches to the specified remote.
--all
git log --graph --graph flag draws a text based graph of commits on left side of commit git push <remote> Tags aren’t automatically pushed when you push a branch or use the
--decorate msgs. --decorate adds names of branches or tags of commits shown. --tags --all flag. The --tags flag sends all of your local tags to the remote repo.
74
CREATE TEMPORARY VIEW v MIN returns the minimum value in a list CREATE TRIGGER before_insert_person
AS BEFORE INSERT
SELECT c1, c2 ON person FOR EACH ROW
FROM t; EXECUTE stored_procedure;
Create a temporary view Create a trigger invoked before a new row is
inserted into the person table
n
VIP Refresher: Probabilities and Statistics Remark: for any event B in the sample space, we have P (B) = P (B|Ai )P (Ai ).
i=1
ÿ
Afshine Amidi and Shervine Amidi r Extended form of Bayes’ rule – Let {Ai , i œ [[1,n]]} be a partition of the sample space.
We have:
P (B|Ak )P (Ak )
August 6, 2018 P (Ak |B) = n
P (B|Ai )P (Ai )
i=1
ÿ
i=1 i=1
defined as:
€ ÿ
F (x) = P (X 6 x)
r Permutation – A permutation is an arrangement of r objects from a pool of n objects, in a
given order. The number of such arrangements is given by P (n, r), defined as: Remark: we have P (a < X 6 B) = F (b) ≠ F (a).
n!
P (n, r) = r Probability density function (PDF) – The probability density function f is the probability
(n ≠ r)! that X takes on values between two adjacent realizations of the random variable.
r Relationships involving the PDF and CDF – Here are the important properties to know
r Combination – A combination is an arrangement of r objects from a pool of n objects, where in the discrete (D) and the continuous (C) cases.
the order does not matter. The number of such arrangements is given by C(n, r), defined as:
P (n, r) n! Case CDF F PDF f Properties of PDF
C(n, r) = =
r! r!(n ≠ r)!
(D) F (x) = P (X = xi ) f (xj ) = P (X = xj ) 0 6 f (xj ) 6 1 and f (xj ) = 1
Remark: we note that for 0 6 r 6 n, we have P (n,r) > C(n,r).
xi 6x j
ÿ ÿ
x dF +Œ
Conditional Probability (C) F (x) = f (y)dy f (x) = f (x) > 0 and f (x)dx = 1
ˆ ˆ
≠Œ dx ≠Œ
r Bayes’ rule – For events A and B such that P (B) > 0, we have:
P (B|A)P (A)
P (A|B) = r Variance – The variance of a random variable, often noted Var(X) or ‡ 2 , is a measure of the
P (B) spread of its distribution function. It is determined as follows:
Remark: we have P (A fl B) = P (A)P (B|A) = P (A|B)P (B). Var(X) = E[(X ≠ E[X])2 ] = E[X 2 ] ≠ E[X]2
r Partition – Let {Ai , i œ [[1,n]]} be such that for all i, Ai ”= ?. We say that {Ai } is a partition
if we have: r Standard deviation – The standard deviation of a random variable, often noted ‡, is a
n
measure of the spread of its distribution function which is compatible with the units of the
actual random variable. It is determined as follows:
’i ”= j, Ai fl Aj = ÿ and Ai = S
i=1 ‡= Var(X)
€
r Expectation and Moments of the Distribution – Here are the expressions of the expected r Marginal density and cumulative distribution – From the joint density probability
value E[X], generalized expected value E[g(X)], kth moment E[X k ] and characteristic function function fXY , we have:
Â(Ê) for the discrete and continuous cases:
Case Marginal density Cumulative function
Case E[X] E[g(X)] E[X k ] Â(Ê) (D) fX (xi ) = fXY (xi ,yj ) FXY (x,y) = fXY (xi ,yj )
n n n n j xi 6x yj 6y
ÿ ÿÿ
≠Œ ≠Œ ≠Œ
+Œ +Œ +Œ +Œ
(C) xf (x)dx g(x)f (x)dx xk f (x)dx f (x)eiÊx dx
ˆ ˆ ˆ ˆ
≠Œ ≠Œ ≠Œ ≠Œ
r Distribution of a sum of independent random variables – Let Y = X1 + ... + Xn with
X1 , ..., Xn independent. We have:
Remark: we have eiÊx = cos(Êx) + i sin(Êx). n
1 ˆk  2
E[X k ] = r Covariance – We define the covariance of two random variables X and Y , that we note ‡XY
ik ˆÊ k or more commonly Cov(X,Y ), as follows:
Ê=0
5 6
2
Cov(X,Y ) , ‡XY = E[(X ≠ µX )(Y ≠ µY )] = E[XY ] ≠ µX µY
r Transformation of random variables – Let the variables X and Y be linked by some
function. By noting fX and fY the distribution function of X and Y respectively, we have:
r Correlation – By noting ‡X , ‡Y the standard deviations of X and Y , we define the correlation
between the random variables X and Y , noted flXY , as follows:
dy 2
‡XY
flXY =
- -
‡ X ‡Y
fY (y) = fX (x) -
- dx -
-
r Leibniz integral rule – Let g be a function of x and potentially c, and a, b boundaries that
may depend on c. We have: Remarks: For any X, Y , we have flXY œ [≠1,1]. If X and Y are independent, then flXY = 0.
r Chebyshev’s inequality – Let X be a random variable with expected value µ and standard P (X = x) = px q n≠x (peiÊ + q)n np npq
deviation ‡. For k, ‡ > 0, we have the following inequality: x
X ≥ B(n, p)
Binomial
1n2
1
x œ [[0,n]]
P (|X ≠ µ| > k‡) 6 (D)
k2 iÊ
P (X = x) = e eµ(e µ µ
µx ≠µ ≠1)
X ≥ Po(µ)
x!
Poisson xœN
Jointly Distributed Random Variables
1 eiÊb ≠ eiÊa a+b (b ≠ a)2
X ≥ U (a, b) f (x) =
2 12
r Conditional density – The conditional density of X with respect to Y , often noted fX|Y ,
b≠a (b ≠ a)iÊ
Uniform x œ [a,b]
is defined as follows:
1 1 2
fXY (x,y) 2
≠1 ‡ ‡2
fX|Y (x) = (C) X ≥ N (µ, ‡) f (x) = Ô e eiʵ≠ 2 Ê µ ‡2
fY (y) 2fi‡
Gaussian
! x≠µ "2
xœR
1 1 1
r Independence – Two random variables X and Y are said to be independent if we have: X ≥ Exp(⁄) f (x) = ⁄e≠⁄x iÊ
1≠ ⁄
⁄ ⁄2
fXY (x,y) = fX (x)fY (y) Exponential x œ R+
Parameter estimation
r Random sample – A random sample is a collection of n random variables X1 , ..., Xn that
are independent and identically distributed with X.
r Estimator – An estimator ◊ˆ is a function of the data that is used to infer the value of an
unknown parameter ◊ in a statistical model.
r Bias – The bias of an estimator ◊ˆ is defined as being the difference between the expected
value of the distribution of ◊ˆ and the true value, i.e.:
Bias(◊)
ˆ = E[◊]
ˆ ≠◊
r Central Limit Theorem – Let us have a random sample X1 , ..., Xn following a given
distribution with mean µ and variance ‡ 2 , then we have:
‡
X
næ+Œ
≥ N µ, Ô
n
1 2
k
With Replacement n
0.4
●
k
⇣ n + k 1⌘
n! ● ●
0.2
Without Replacement
(n k)! k
● ●
⇣ n⌘
band-aid
0.0
●
1 if A occurs,
1.0
● ●
IA =
0 if A does not occur.
(
1
E(X) = xf (x)dx (for continuous X)
0.8
2
Z
1
● ● Note that IA = IA , IA IB = IA\B , and IA[B = IA + IB I A IB .
The Law of the Unconscious Statistician (LOTUS) states that
0.6
you can find the expected value of a function of a random variable,
cdf
Distribution IA ⇠ Bern(p) where p = P (A).
Fundamental Bridge The expectation of the indicator for event A is g(X), in a similar way, by replacing the x in front of the PMF/PDF by
0.4
● ● the probability of event A: E(IA ) = P (A). g(x) but still working with the PMF/PDF of X:
0.2
E(g(X)) = g(x)P (X = x) (for discrete X)
● ● Variance and Standard Deviation
● x
0.0
X
2 2 2
Var(X) = E (X E(X)) = E(X ) (E(X))
0 1 2 3 4 1
E(g(X)) = g(x)f (x)dx (for continuous X)
x SD(X) = Var(X)
Z
1
q
The CDF is an increasing, right-continuous function with What’s a function of a random variable? A function of a random
Continuous RVs, LOTUS, UoU variable is also a random variable. For example, if X is the number of
bikes you see in an hour, then g(X) = 2X is the number of bike wheels
FX (x) ! 0 as x ! 1 and FX (x) ! 1 as x ! 1
X(X 1)
Independence Intuitively, two random variables are independent if Continuous Random Variables (CRVs) you see in that hour and h(X) = X 2 = 2 is the number of
knowing the value of one gives no information about the other. pairs of bikes such that you see both of those bikes in that hour.
Discrete r.v.s X and Y are independent if for all values of x and y What’s the probability that a CRV is in an interval? Take the
di↵erence in CDF values (or use the PDF as described later). What’s the point? You don’t need to know the PMF/PDF of g(X)
P (X = x, Y = y) = P (X = x)P (Y = y) to find its expected value. All you need is the PMF/PDF of X.
P (a X b) = P (X b) P (X a) = FX (b) FX (a)
Expected Value and Indicators 2
For X ⇠ N (µ, ), this becomes Universality of Uniform (UoU)
Expected Value and Linearity b µ a µ When you plug any CRV into its own CDF, you get a Uniform(0,1)
P (a X b) = random variable. When you plug a Uniform(0,1) r.v. into an inverse
Expected Value (a.k.a. mean, expectation, or average) is a weighted
✓ ◆ ✓ ◆
CDF, you get an r.v. with that CDF. For example, let’s say that a
average of the possible outcomes of our random variable.
random variable X has CDF
Mathematically, if x1 , x2 , x3 , . . . are all of the distinct possible values What is the Probability Density Function (PDF)? The PDF f
that X can take, the expected value of X is is the derivative of the CDF F . x
F (x) = 1 e , for x > 0
E(X) = xi P (X = xi ) 0
F (x) = f (x)
i By UoU, if we plug X into this function then we get a uniformly
P
1 0 1
0.30
5 9 14
0.8
4 1 5
0.20
0.6
n n n 0.10
1 1 1
n∑ xi + n∑ yi = n∑ (xi + yi)
0.2
0.00
Mean E(X) = µ1
E(g(X)) = E(g(Y )) How do I find the expected value of a CRV? Analogous to the
discrete case, where you sum x times the PMF, for CRVs you integrate Variance Var(X) = µ2 µ21
Conditional Expected Value is defined like expectation, only
x times the PDF.
conditioned on any event A. 1
Skewness Skew(X) = m3
E(X) = xf (x)dx
E(X|A) = xP (X = x|A)
Z
1
x Kurtosis Kurt(X) = m4 3
P
85
Moment Generating Functions Marginal Distributions Covariance Properties For random variables W, X, Y, Z and
constants a, b:
MGF For any random variable X, the function To find the distribution of one (or more) random variables from a joint
tX PMF/PDF, sum/integrate over the unwanted random variables.
MX (t) = E(e ) Cov(X, Y ) = Cov(Y, X)
is the moment generating function (MGF) of X, if it exists for all Marginal PMF from joint PMF Cov(X + a, Y + b) = Cov(X, Y )
t in some open interval containing 0. The variable t could just as well Cov(aX, bY ) = abCov(X, Y )
P (X = x) = P (X = x, Y = y)
have been called u or v. It’s a bookkeeping device that lets us work
y Cov(W + X, Y + Z) = Cov(W, Y ) + Cov(W, Z) + Cov(X, Y )
X
1
k (k)
µk = E(X ) = MX (0) constants a, b, c, d with a and c nonzero,
Independence of Random Variables
This is true by Taylor expansion of etX since Corr(aX + b, cY + d) = Corr(X, Y )
Random variables X and Y are independent if and only if any of the
1 k k 1 k
tX E(X )t µk t following conditions holds:
MX (t) = E(e )= =
k=0
k! k=0
k! • Joint CDF is the product of the marginal CDFs Transformations
X X
1 1
Joint Distributions
Covariance and Correlation
The joint CDF of X and Y is
Covariance is the analog of variance for two random variables. be the Jacobian matrix. If the entries in this matrix exist and are
F (x, y) = P (X x, Y y) continuous, and the determinant of the matrix is never 0, then
Cov(X, Y ) = E ((X E(X))(Y E(Y ))) = E(XY ) E(X)E(Y )
In the discrete case, X and Y have a joint PMF
Note that @(u, v)
pX,Y (x, y) = P (X = x, Y = y). 2 2 fX,Y (x, y) = fU,V (u, v)
Cov(X, X) = E(X ) (E(X)) = Var(X) @(x, y)
In the continuous case, they have a joint PDF
Correlation is a standardized version of covariance that is always
@2 The inner bars tells us to take the matrix’s determinant, and the outer
between 1 and 1.
fX,Y (x, y) = FX,Y (x, y). bars tell us to take the absolute value. In a 2 ⇥ 2 matrix,
@x@y Cov(X, Y )
The joint PMF/PDF must be nonnegative and sum/integrate to 1. Var(X)Var(Y ) a b
bc|
c d
= |ad
Corr(X, Y ) = p
Var(X1 + X2 + · · · + Xn ) = 1
P (X = x) P (X = x) i=1 i<j
X X
Conditioning and Bayes’ rule for continuous r.v.s If X and Y are independent then they have covariance 0, so Example Let X, Y ⇠ N (0, 1) be i.i.d. Then for each fixed t,
fX,Y (x, y) fX|Y (x|y)fY (y)
= X ?
? Y =) Var(X + Y ) = Var(X) + Var(Y ) 1 1 1
x2 /2 (t x)2 /2
fY |X (y|x) =
fX (x) fX (x) fX+Y (t) = dx
If X1 , X2 , . . . , Xn are identically distributed and have the same p e
2⇡
p e
2⇡
Hybrid Bayes’ rule
Z
1
covariance relationships (often by symmetry), then
P (A|X = x)fX (x) By completing the square and using the fact that a Normal PDF
fX (x|A) = Cov(X1 , X2 )
P (A)
Var(X1 + X2 + · · · + Xn ) = nVar(X1 ) + 2
2 integrates to 1, this works out to fX+Y (t) being the N (0, 2) PDF.
⇣ n⌘
86
Poisson Process • Let T ⇠ Expo(1/10) be how long you have to wait until the Central Limit Theorem (CLT)
shuttle comes. Given that you have already waited t minutes,
the expected additional waiting time is 10 more minutes, by the Approximation using CLT
Definition We have a Poisson process of rate arrivals per unit
time if the following conditions hold: ˙ to denote is approximately distributed. We can use the
memoryless property. That is, E(T |T > t) = t + 10.
We use ⇠
Central Limit Theorem to approximate the distribution of a random
1. The number of arrivals in a time interval of length t is Pois( t). Discrete Y Continuous Y 2
variable Y = X1 + X2 + · · · + Xn that is a sum of n i.i.d. random
variables Xi . Let E(Y ) = µY and Var(Y ) = Y . The CLT says
2. Numbers of arrivals in disjoint time intervals are independent.
E(Y ) = y yP (Y = y) E(Y ) = 1
yfY (y)dy 2
For example, the numbers of arrivals in the time intervals [0, 5], Y ⇠
˙ N (µY , Y )
P R1
y yP (Y = y|A) yf (y|A)dy
(5, 12), and [13, 23) are independent with Pois(5 ), Pois(7 ), Pois(10 ) E(Y |A) = E(Y |A) = 1 2
If the Xi are i.i.d. with mean µX and variance X , then µY = nµX
R1
distributions, respectively.
P
2 2
and Y =n X . For the sample mean X̄n , the CLT says
+ + + + +
1
Conditioning on a Random Variable We can also find E(Y |X),
the expected value of Y given the random variable X. This is a 2
0 T1 T2 T3 T4 T5 X̄n = (X1 + X2 + · · · + Xn ) ⇠
˙ N (µX , X /n)
function of the random variable X. It is not a number except in n
Count-Time Duality Consider a Poisson process of emails arriving certain special cases such as if X ?
? Y . To find E(Y |X), find
in an inbox at rate emails per hour. Let Tn be the time of arrival of E(Y |X = x) and then plug in X for x. For example: Asymptotic Distributions using CLT
the nth email (relative to some starting time 0) and Nt be the number D
of emails that arrive in [0, t]. Let’s find the distribution of T1 . The • If E(Y |X = x) = x3 + 5x, then E(Y |X) = X 3 + 5X. We use ! to denote converges in distribution to as n ! 1. The
event T1 > t, the event that you have to wait more than t hours to get CLT says that if we standardize the sum X1 + · · · + Xn then the
the first email, is the same as the event Nt = 0, which is the event that • Let Y be the number of successes in 10 independent Bernoulli distribution of the sum converges to N (0, 1) as n ! 1:
there are no emails in the first t hours. So trials with probability p of success and X be the number of
1 D
t t
successes among the first 3 trials. Then E(Y |X) = X + 7p. nµX ) ! N (0, 1)
e
p (X1 + · · · + Xn
P (T1 > t) = P (Nt = 0) = e ! P (T1 t) = 1
2 2
n
• Let X ⇠ N (0, 1) and Y = X . Then E(Y |X = x) = x since if
Thus we have T1 ⇠ Expo( ). By the memoryless property and similar we know X = x then we know Y = x2 . And E(X|Y = y) = 0 In other words, the CDF of the left-hand side goes to the standard
reasoning, the interarrival times between emails are i.i.d. Expo( ), i.e.,
p
since if we know Y = y then we know X = ± y, with equal Normal CDF, . In terms of the sample mean, the CLT says
the di↵erences Tn Tn 1 are i.i.d. Expo( ). p
n(X̄n µX ) D
probabilities (by symmetry). So E(Y |X) = X 2 , E(X|Y ) = 0.
! N (0, 1)
Properties of Conditional Expectation X
Order Statistics
1. E(Y |X) = E(Y ) if X ?
?Y Markov Chains
Definition Let’s say you have n i.i.d. r.v.s X1 , X2 , . . . , Xn . If you
arrange them from smallest to largest, the ith element in that list is 2. E(h(X)W |X) = h(X)E(W |X) (taking out what’s known)
the ith order statistic, denoted X(i) . So X(1) is the smallest in the list In particular, E(h(X)|X) = h(X). Definition
and X(n) is the largest in the list.
3. E(E(Y |X)) = E(Y ) (Adam’s Law, a.k.a. Law of Total 5/12 7/12 7/8
Note that the order statistics are dependent, e.g., learning X(4) = 42 Expectation) 1 1/2 1/3 1/4
1 2 3 4 5
gives us the information that X(1) , X(2) , X(3) are 42 and 1/2 1/4 1/6 1/8
X(5) , X(6) , . . . , X(n) are 42. Adam’s Law (a.k.a. Law of Total Expectation) can also be
Distribution Taking n i.i.d. random variables X1 , X2 , . . . , Xn with written in a way that looks analogous to LOTP. For any events A Markov chain is a random walk in a state space, which we will
CDF F (x) and PDF f (x), the CDF and PDF of X(i) are: A1 , A2 , . . . , An that partition the sample space, assume is finite, say {1, 2, . . . , M }. We let Xt denote which element of
the state space the walk is visiting at time t. The Markov chain is the
n sequence of random variables tracking where the walk is at all points
k n k
E(Y ) = E(Y |A1 )P (A1 ) + · · · + E(Y |An )P (An )
FX(i) (x) = P (X(i) x) = F (x) (1 F (x)) in time, X0 , X1 , X2 , . . . . By definition, a Markov chain must satisfy
k=i
k For the special case where the partition is A, Ac , this says the Markov property, which says that if you want to predict where
n ⇣ ⌘
X
the chain will be at a future time, if we know the present state then
c c
i 1 n i the entire past history is irrelevant. Given the present, the past and
fX(i) (x) = n F (x) (1 F (x)) f (x) E(Y ) = E(Y |A)P (A) + E(Y |A )P (A )
i 1 future are conditionally independent. In symbols,
⇣n 1⌘
Uniform Order Statistics The jth order statistic of Eve’s Law (a.k.a. Law of Total Variance) P (Xn+1 = j|X0 = i0 , X1 = i1 , . . . , Xn = i) = P (Xn+1 = j|Xn = i)
i.i.d. U1 , . . . , Un ⇠ Unif(0, 1) is U(j) ⇠ Beta(j, n j + 1).
Var(Y ) = E(Var(Y |X)) + Var(E(Y |X))
State Properties
Conditional Expectation A state is either recurrent or transient.
MVN, LLN, CLT • If you start at a recurrent state, then you will always return
Conditioning on an Event We can find E(Y |A), the expected value back to that state at some point in the future. ♪You can
of Y given that event A occurred. A very important case is when A is check-out any time you like, but you can never leave. ♪
the event X = x. Note that E(Y |A) is a number. For example: Law of Large Numbers (LLN) • Otherwise you are at a transient state. There is some positive
• The expected value of a fair die roll, given that it is prime, is Let X1 , X2 , X3 . . . be i.i.d. with mean µ. The sample mean is probability that once you leave you will never return. ♪You
1 1 1 10 don’t have to go home, but you can’t stay here. ♪
3 ·2+ 3 ·3+ 3 ·5 = 3 .
M ⇥ M matrix where element qij is the probability that the chain goes Uniform Distribution
0.2
from state i to state j in one step:
0.10
Let us say that U is distributed Unif(a, b). We know the following:
PDF
PDF
Properties of the Uniform For a Uniform distribution, the
0.1
qij = P (Xn+1 = j|Xn = i)
0.05
probability of a draw from any interval within the support is
To find the probability that the chain goes from state i to state j in proportional to the length of the interval. See Universality of Uniform
0.0
0.00
and Order Statistics for other properties. 0 5 10 15 20 0 5 10 15 20
exactly m steps, take the (i, j) element of Qm . x x
Example William throws darts really badly, so his darts are uniform Gamma(10, 1) Gamma(5, 0.5)
(m)
0.10
qij = P (Xn+m = j|Xn = i) over the whole room because they’re equally likely to appear anywhere.
William’s darts have a Uniform distribution on the surface of the
0.10
~, i.e.,
If X0 is distributed according to the row vector PMF p room. The Uniform is the only distribution where the probability of
0.05
PDF
PDF
pj = P (X0 = j), then the PMF of Xn is p ~Qn . hitting in any specific region is proportional to the length/area/volume
0.05
of that region, and where the density of occurrence in any one specific
spot is constant throughout the whole support.
Chain Properties
0.00
0.00
0 5 10 15 20 0 5 10 15 20
x x
A chain is irreducible if you can get from anywhere to anywhere. If a Normal Distribution
chain (on a finite state space) is irreducible, then all of its states are Let us say that X is distributed Gamma(a, ). We know the following:
recurrent. A chain is periodic if any of its states are periodic, and is
Let us say that X is distributed N (µ, 2 ). We know the following:
aperiodic if none of its states are periodic. In an irreducible chain, all Central Limit Theorem The Normal distribution is ubiquitous Story You sit waiting for shooting stars, where the waiting time for a
states have the same period. because of the Central Limit Theorem, which states that the sample star is distributed Expo( ). You want to see n shooting stars before
mean of i.i.d. r.v.s will approach a Normal distribution as the sample you go home. The total waiting time for the nth shooting star is
A chain is reversible with respect to ~ s if si qij = sj qji for all i, j. size grows, regardless of the initial distribution. Gamma(n, ).
Examples of reversible chains include any chain with qij = qji , with Location-Scale Transformation Every time we shift a Normal Example You are at a bank, and there are 3 people ahead of you.
1 1 1
~
s = ( M , M , . . . , M ), and random walk on an undirected network. r.v. (by adding a constant) or rescale a Normal (by multiplying by a The serving time for each person is Exponential with mean 2 minutes.
constant), we change it to another Normal r.v. For any Normal Only one person at a time can be served. The distribution of your
Stationary Distribution X ⇠ N (µ, 2 ), we can transform it to the standard N (0, 1) by the waiting time until it’s your turn to be served is Gamma(3, 12 ).
following transformation:
Let us say that the vector ~s = (s1 , s2 , . . . , sM ) be a PMF (written as a X µ Beta Distribution
row vector). We will call ~
s the stationary distribution for the chain Z= ⇠ N (0, 1)
Beta(0.5, 0.5) Beta(2, 1)
if ~
sQ = ~s. As a consequence, if Xt has the stationary distribution,
5
2.0
then all future Xt+1 , Xt+2 , . . . also have the stationary distribution.
4
PDF
PDF
To find the stationary distribution, you can solve the matrix equation
Story You’re sitting on an open meadow right before the break of 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(Q0 I)~ s 0 = 0. The stationary distribution is uniform if the columns x x
of Q sum to 1. dawn, wishing that airplanes in the night sky were shooting stars,
Beta(2, 8) Beta(5, 5)
because you could really use a wish right now. You know that shooting
2.5
Reversibility Condition Implies Stationarity If you have a PMF ~ s stars come on average every 15 minutes, but a shooting star is not
3
2.0
and a Markov chain with transition matrix Q, then si qij = sj qji for “due” to come just because you’ve waited so long. Your waiting time
1.5
Random Walk on an Undirected Network Example The waiting time until the next shooting star is distributed
0
0.0
Expo(4) hours. Here = 4 is the rate parameter, since shooting 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
1
stars arrive at a rate of 1 per 1/4 hour on average. The expected time
until the next shooting star is 1/ = 1/4 hour.
2
Expos as a rescaled Expo(1) Conjugate Prior of the Binomial In the Bayesian approach to
Y ⇠ Expo( ) ! X = Y ⇠ Expo(1) statistics, parameters are viewed as random variables, to reflect our
3 uncertainty. The prior for a parameter is its distribution before
Memorylessness The Exponential Distribution is the only observing data. The posterior is the distribution for the parameter
4
continuous memoryless distribution. The memoryless property says after observing data. Beta is the conjugate prior of the Binomial
that for X ⇠ Expo( ) and any positive numbers s and t, because if you have a Beta-distributed prior on p in a Binomial, then
P (X > s + t|X > s) = P (X > t) the posterior distribution on p given the Binomial data is also
5
Beta-distributed. Consider the following two-level model:
Equivalently,
X a|(X > a) ⇠ Expo( ) X|p ⇠ Bin(n, p)
If you have a collection of nodes, pairs of which can be connected by For example, a product with an Expo( ) lifetime is always “as good as
undirected edges, and a Markov chain is run by going from the
p ⇠ Beta(a, b)
new” (it doesn’t experience wear and tear). Given that the product has
current node to a uniformly random node that is connected to it by an survived a years, the additional time that it will last is still Expo( ). Then after observing X = x, we get the posterior distribution
edge, then this is a random walk on an undirected network. The then
i ), x)
stationary distribution of this chain is proportional to the degree
Min of Expos If we have independent Xi ⇠ Expo( p|(X = x) ⇠ Beta(a + x, b + n
sequence (this is the sequence of degrees, where the degree of a node
min(X1 , . . . , Xk ) ⇠ Expo( 1 + 2 + · · · + k ).
Order statistics of the Uniform See Order Statistics.
is how many edges are attached to it). For example, the stationary Max of Expos If we have i.i.d. Xi ⇠ Expo( ), then
distribution of random walk on the network shown above is max(X1 , . . . , Xk ) has the same distribution as Y1 + Y2 + · · · + Yk ,
3 3 3 4 2
Beta-Gamma relationship If X ⇠ Gamma(a, ),
proportional to (3, 3, 2, 4, 2), so it’s ( 14 , 14 , 14 , 14 , 14 ). where Yj ⇠ Expo(j ) and the Yj are independent. Y ⇠ Gamma(b, ), with X ?
? Y then
88
X
• X+Y ⇠ Beta(a, b) • Conditional X|(X + Y = r) ⇠ HGeom(n, m, r) Properties Let X ⇠ Pois( 1) and Y ⇠ Pois( 2 ), with X ?
? Y.
X • Binomial-Poisson Relationship Bin(n, p) is approximately
• X +Y ?
? X+Y
1. Sum X + Y ⇠ Pois( 1 + 2)
Pois( ) if p is small.
1
This is known as the bank–post office result. • Binomial-Normal Relationship Bin(n, p) is approximately
2. Conditional X|(X + Y = n) ⇠ Bin n,
1+ 2
⇣ ⌘
2 N (np, np(1 p)) if n is large and p is not near 0 or 1. 3. Chicken-egg If there are Z ⇠ Pois( ) items and we randomly
(Chi-Square) Distribution and independently “accept” each item with probability p, then
2 Geometric Distribution
Let us say that X is distributed n. We know the following: the number of accepted items Z1 ⇠ Pois( p), and the number of
Story A Chi-Square(n) is the sum of the squares of n independent Let us say that X is distributed Geom(p). We know the following:
rejected items Z2 ⇠ Pois( (1 p)), and Z1 ?? Z2 .
standard Normal r.v.s. Story X is the number of “failures” that we will achieve before we
Properties and Representations achieve our first success. Our successes have probability p. Multivariate Distributions
1
2 2 2 Example If each pokeball we throw has probability 10to catch Mew,
1 Multinomial Distribution
).
X is distributed as Z1 + Z2 + · · · + Zn for i.i.d. Zi ⇠ N (0, 1)
the number of failed pokeballs will be distributed Geom( 10
X ⇠ Gamma(n/2, 1/2) Let us say that the vector X ~)
~ = (X1 , X2 , X3 , . . . , Xk ) ⇠ Multk (n, p
First Success Distribution where p
~ = (p1 , p2 , . . . , pk ).
Discrete Distributions Equivalent to the Geometric distribution, except that it includes the Story We have n items, which can fall into any one of the k buckets
first success in the count. This is 1 more than the number of failures. independently with the probabilities p ~ = (p1 , p2 , . . . , pk ).
If X ⇠ FS(p) then E(X) = 1/p.
Distributions for four sampling schemes Example Let us assume that every year, 100 students in the Harry
Negative Binomial Distribution Potter Universe are randomly and independently sorted into one of
Replace No Replace four houses with equal probability. The number of people in each of the
Let us say that X is distributed NBin(r, p). We know the following: ~), where p
houses is distributed Mult4 (100, p ~ = (0.25, 0.25, 0.25, 0.25).
Fixed # trials (n) Binomial HGeom
(Bern if n = 1) Story X is the number of “failures” that we will have before we Note that X1 + X2 + · · · + X4 = 100, and they are dependent.
Draw until r success NBin NHGeom achieve our rth success. Our successes have probability p. Joint PMF For n = n1 + n2 + · · · + nk ,
(Geom if r = 1) Example Thundershock has 60% accuracy and can faint a wild n! n n n
Raticate in 3 hits. The number of misses before Pikachu faints P (X
~ =~n) = p 1 p 2 . . . pk k
n1 !n2 ! . . . nk ! 1 2
Raticate with Thundershock is distributed NBin(3, 0.6).
Bernoulli Distribution Marginal PMF, Lumping, and Conditionals Marginally,
The Bernoulli distribution is the simplest case of the Binomial Hypergeometric Distribution Xi ⇠ Bin(n, pi ) since we can define “success” to mean category i. If
distribution, where we only have one trial (n = 1). Let us say that X is you lump together multiple categories in a Multinomial, then it is still
Let us say that X is distributed HGeom(w, b, n). We know the
distributed Bern(p). We know the following: following:
Multinomial. For example, Xi + Xj ⇠ Bin(n, pi + pj ) for i 6= j since
we can define “success” to mean being in category i or j. Similarly, if
Story A trial is performed with probability p of “success”, and X is Story In a population of w desired objects and b undesired objects, k = 6 and we lump categories 1-2 and lump categories 3-5, then
the indicator of success: 1 means success, 0 means failure. X is the number of “successes” we will have in a draw of n objects, (X1 + X2 , X3 + X4 + X5 , X6 ) ⇠ Mult3 (n, (p1 + p2 , p3 + p4 + p5 , p6 ))
Example Let X be the indicator of Heads for a fair coin toss. Then without replacement. The draw of n objects is assumed to be a
simple random sample (all sets of n objects are equally likely). Conditioning on some Xj also still gives a Multinomial:
X ⇠ Bern( 12 ). Also, 1 X ⇠ Bern( 12 ) is the indicator of Tails.
Examples Here are some HGeom examples. p1 pk 1
Binomial Distribution X1 , . . . , Xk 1 |Xk = nk ⇠ Multk 1 n nk , ,...,
1 pk 1 pk
✓ ✓ ◆◆
0.30
Var(Xi ) = npi (1 pi ). Also, Cov(Xi , Xj ) = npi pj for i 6= j.
●
0.25
• The number of Aces in a 5 card hand. Multivariate Uniform Distribution
● ●
0.20
• You have w white balls and b black balls, and you draw n balls. See the univariate Uniform for stories and examples. For the 2D
pmf
0.15
You will draw X white balls. Uniform on some region, probability is proportional to area. Every
● ●
0.10
• You have w white balls and b black balls, and you draw n balls point in the support has equal density, of value area of1 region . For the
● ●
0.05
without replacement. The number of white balls in your sample 3D Uniform, probability is proportional to volume.
● ●
● ● is HGeom(w, b, n); the number of black balls is HGeom(b, w, n).
0.00
0 2 4 6 8 10
• Capture-recapture A forest has N elk, you capture n of them, Multivariate Normal (MVN) Distribution
x
tag them, and release them. Then you recapture a new sample A vector X
~ = (X1 , X2 , . . . , Xk ) is Multivariate Normal if every linear
Let us say that X is distributed Bin(n, p). We know the following: of size m. How many tagged elk are now in the new sample? combination is Normally distributed, i.e., t1 X1 + t2 X2 + · · · + tk Xk is
Story X is the number of “successes” that we will achieve in n HGeom(n, N n, m) Normal for any constants t1 , t2 , . . . , tk . The parameters of the
independent trials, where each trial is either a success or a failure, each Multivariate Normal are the mean vector µ ~ = (µ1 , µ2 , . . . , µk ) and
with the same probability p of success. We can also write X as a sum Poisson Distribution the covariance matrix where the (i, j) entry is Cov(Xi , Xj ).
of multiple independent Bern(p) random variables. Let X ⇠ Bin(n, p)
and Xj ⇠ Bern(p), where all of the Bernoullis are independent. Then Let us say that X is distributed Pois( ). We know the following: Properties The Multivariate Normal has the following properties.
Story There are rare events (low probability events) that occur many • Any subvector is also MVN.
X = X1 + X2 + X3 + · · · + Xn di↵erent ways (high possibilities of occurences) at an average rate of
occurrences per unit space or time. The number of events that occur • If any two elements within an MVN are uncorrelated, then they
Example If Jeremy Lin makes 10 free throws and each one in that unit of space or time is X. are independent.
independently has a 34 chance of getting in, then the number of free
Example A certain busy intersection has an average of 2 accidents
• The joint PDF of a Bivariate Normal (X, Y ) with N (0, 1)
throws he makes is distributed Bin(10, 34 ).
per month. Since an accident is a low probability event that can
marginal distributions and correlation ⇢ 2 ( 1, 1) is
Properties Let X ⇠ Bin(n, p), Y ⇠ Bin(m, p) with X ?
? Y. happen many di↵erent ways, it is reasonable to model the number of 1 1 2 2
fX,Y (x, y) = exp 2
(x + y 2⇢xy) ,
accidents in a month at that intersection as Pois(2). Then the number 2⇡⌧ 2⌧
• Redefine success n p)
✓ ◆
A convolution of n random variables is simply their sum. For the N3 ⇠ FS((n 2)/n), and Nj ⇠ FS((n j + 1)/n). By linearity,
following results, let X and Y be independent.
Miscellaneous Definitions n
+ n n n
1. X ⇠ Pois( 1 ), Y ⇠ Pois( 2) ! X + Y ⇠ Pois( 1 2) E(N ) = E(N1 ) + · · · + E(Nn ) = + + ··· + = n
n n 1 1 j=1
j
X1
3. Gamma(1, ) ⇠ Expo( )
2 n 1
Calculating Probability
4. n ⇠ Gamma 2, 2 Expectation of Negative Hypergeometric
A textbook has n typos, which are randomly scattered amongst its n
What is the expected number of cards that you draw before you pick
5. NBin(1, p) ⇠ Geom(p)
pages, independently. You pick a random page. What is the
1 your first Ace in a shu✏ed deck (not counting the Ace)? Answer:
Inequalities probability that it has no typos? Answer: There is a 1 n
probability that any specific typo isn’t on your page, and thus a Consider a non-Ace. Denote this to be card j. Let Ij be the indicator
E(X 2 )E(Y 2 ) that card j will be drawn before the first Ace. Note that Ij = 1 says
1 n
1. Cauchy-Schwarz |E(XY )|
1 probability that there are no typos on your page. For n that j is before all 4 of the Aces in the deck. The probability that this
E|X|
p
2. Markov P (X a for a > 0 n occurs is 1/5 by symmetry. Let X be the number of cards drawn
✓ ◆
a)
2 2 1 before the first Ace. Then X = I1 + I2 + ... + I48 , where each indicator
3. Chebyshev P (|X µ| for E(X) = µ, Var(X) = large, this is approximately e = 1/e.
a) a2 corresponds to one of the 48 non-Aces. Thus,
4. Jensen E(g(X)) g(E(X)) for g convex; reverse if g is E(X) = E(I1 ) + E(I2 ) + ... + E(I48 ) = 48/5 = 9.6 .
concave Linearity and Indicators (1)
Minimum and Maximum of RVs
Formulas In a group of n people, what is the expected number of distinct
birthdays (month and day)? What is the expected number of birthday What is the CDF of the maximum of n independent Unif(0,1) random
matches? Answer: Let X be the number of distinct birthdays and Ij variables? Answer: Note that for r.v.s X1 , X2 , . . . , Xn ,
Geometric Series be the indicator for the jth day being represented. P (min(X1 , X2 , . . . , Xn ) a) = P (X1 a, X2 a, . . . , Xn a)
n n n
Similarly,
2 n 1 k 1 r E(Ij ) = 1 P (no one born on day j) = 1 (364/365)
1 + r + r + ··· + r = r =
1 r
P (max(X1 , X2 , . . . , Xn ) a) = P (X1 a, X2 a, . . . , Xn a)
k=0
X1
n!1
+ · · · = lim
n=0
n! 2! 3! n
✓ ◆
1).
X
t 1 x a 1 b 1 (a) (b)
x e dx = (t) x (1 x) dx = people who leave with the right hat? Answer: Each hat has a 1/n 1 k 1 k+1
0 0 (a + b) chance of going to the right person. By linearity, the average number 1 1 e e e
E = = = (e 1)
Z 1 Z 1
X +1 k=0
k+1 k! k=0
(k + 1)!
✓ ◆
Also, (a + 1) = a (a), and (n) = (n 1)! if n is a positive integer. of hats that go to their owners is n(1/n) = 1 .
X X
90
Adam’s Law and Eve’s Law 4. Calculating expectation. If it has a named distribution,
check out the table of distributions. If it’s a function of an r.v.
William really likes speedsolving Rubik’s Cubes. But he’s pretty bad with a named distribution, try LOTUS. If it’s a count of
at it, so sometimes he fails. On any given day, William will attempt something, try breaking it up into indicator r.v.s. If you can
N ⇠ Geom(s) Rubik’s Cubes. Suppose each time, he has probability p condition on something natural, consider using Adam’s law.
of solving the cube, independently. Let T be the number of Rubik’s
Cubes he solves during a day. Find the mean and variance of T . 5. Calculating variance. Consider independence, named
Answer: Note that T |N ⇠ Bin(N, p). So by Adam’s Law, Robber distributions, and LOTUS. If it’s a count of something, break it
up into a sum of indicator r.v.s. If it’s a sum, use properties of
p(1 s) covariance. If you can condition on something natural, consider
E(T ) = E(E(T |N )) = E(N p) = using Eve’s Law.
s
6. Calculating E(X 2 ). Do you already know E(X) or Var(X)?
Similarly, by Eve’s Law, we have that Recall that Var(X) = E(X 2 ) (E(X))2 . Otherwise try
LOTUS.
Var(T ) = E(Var(T |N )) + Var(E(T |N )) = E(N p(1 p)) + Var(N p)
7. Calculating covariance. Use the properties of covariance. If
2
(a) Is this Markov chain irreducible? Is it aperiodic? Answer:
p(1 p)(1 s) p (1 s) p(1 s)(p + s(1 p)) you’re trying to find the covariance between two components of
= + = Yes to both. The Markov chain is irreducible because it can a Multinomial distribution, Xi , Xj , then the covariance is
s s2 s2
get from anywhere to anywhere else. The Markov chain is npi pj for i 6= j.
aperiodic because the robber can return back to a square in
8. Symmetry. If X1 , . . . , Xn are i.i.d., consider using symmetry.
MGF – Finding Moments 2, 3, 4, 5, . . . moves, and the GCD of those numbers is 1.
3 (b) What is the stationary distribution of this Markov chain? 9. Calculating probabilities of orderings. Remember that all
Find E(X ) for X ⇠ Expo( ) using the MGF of X. Answer: The
Answer: Since this is a random walk on an undirected graph, n! ordering of i.i.d. continuous random variables X1 , . . . , Xn
MGF of an Expo( ) is M (t) = t . To get the third moment, we can are equally likely.
the stationary distribution is proportional to the degree
take the third derivative of the MGF and evaluate at t = 0: sequence. The degree for the corner pieces is 3, the degree for 10. Determining independence. There are several equivalent
the edge pieces is 4, and the degree for the center pieces is 6. definitions. Think about simple and extreme cases to see if you
3 6 To normalize this degree sequence, we divide by its sum. The
E(X ) = 3
can find a counterexample.
sum of the degrees is 6(3) + 6(4) + 7(6) = 84. Thus the
stationary probability of being on a corner is 3/84 = 1/28, on 11. Do a painful integral. If your integral looks painful, see if
But a much nicer way to use the MGF here is via pattern recognition: an edge is 4/84 = 1/21, and in the center is 6/84 = 1/14. you can write your integral in terms of a known PDF (like
note that M (t) looks like it came from a geometric series: Gamma or Beta), and use the fact that PDFs integrate to 1?
(c) What fraction of the time will the robber be in the center tile
1 12. Before moving on. Check some simple and extreme cases,
1 t n! tn in this game, in the long run? Answer: By the above, 1/14 .
= = check whether the answer seems plausible, check for biohazards.
t
1 n=0 n=0
n n!
1 ✓ ◆n
(d) What is the expected amount of moves it will take for the
X X
robber to return to the center tile? Answer: Since this chain is Biohazards
tn
The coefficient of n!here is the nth moment of X, so we have irreducible and aperiodic, to get the expected time to return we
E(X n ) = n!n for all nonnegative integers n. can just invert the stationary probability. Thus on average it
Contributions from Jessy Hwang
will take 14 turns for the robber to return to the center tile.
Markov chains (1) 1. Don’t misuse the naive definition of probability. When
Problem-Solving Strategies answering “What is the probability that in a group of 3 people,
Suppose Xn is a two-state Markov chain with transition matrix no two have the same birth month?”, it is not correct to treat
the people as indistinguishable balls being placed into 12 boxes,
0 1 Contributions from Jessy Hwang, Yuan Jiang, Yuqi Hou since that assumes the list of birth months {January, January,
0 1 ↵ ↵ 1. Getting started. Start by defining relevant events and
Q= January} is just as likely as the list {January, April, June},
1 1 random variables. (“Let A be the event that I pick the fair even though the latter is six times more likely.
✓ ◆
2. Calculating probability of an event. Use counting story and dependent in the Hypergeometric story.
principles if the naive definition of probability applies. Is the
To show that the chain is reversible with respect to ~ s, we must show 4. Don’t forget to do sanity checks. Probabilities must be
probability of the complement easier to find? Look for
si qij = sj qji for all i, j. This is done if we can show s0 q01 = s1 q10 . between 0 and 1. Variances must be 0. Supports must make
symmetries. Look for something to condition on, then apply
And indeed, sense. PMFs must sum to 1. PDFs must integrate to 1.
↵ Bayes’ Rule or the Law of Total Probability.
s0 q01 = = s1 q10 3. Finding the distribution of a random variable. First make 5. Don’t confuse random variables, numbers, and events.
↵+ Let X be an r.v. Then g(X) is an r.v. for any function g. In
sure you need the full distribution not just the mean (see next
Markov chains (2) item). Check the support of the random variable: what values particular, X 2 , |X|, F (X), and IX>3 are r.v.s.
can it take on? Use this to rule out distributions that don’t fit. P (X 2 < X|X 0), E(X), Var(X), and g(E(X)) are numbers.
William and Sebastian play a modified game of Settlers of Catan, Is there a story for one of the named distributions that fits the 1 are events. It does not make sense to
where every turn they randomly move the robber (which starts on the problem at hand? Can you write the random variable as a write 11 F (X)dx, because F (X) is a random variable. It does
center tile) to one of the adjacent hexagons. function of an r.v. with a known distribution, say Y = g(X)? not make sense to write P (X), because X is not an event.
X = 2R and F (X)
91
6. Don’t confuse a random variable with its distribution. Recommended Resources
To get the PDF of X 2 , you can’t just square the PDF of X.
The right way is to use transformations. To get the PDF of
X + Y , you can’t just add the PDF of X and the PDF of Y .
The right way is to compute the convolution. • Introduction to Probability Book
(http://bit.ly/introprobability)
7. Don’t pull non-linear functions out of expectations. • Stat 110 Online (http://stat110.net)
E(g(X)) does not equal g(E(X)) in general. The St. • Stat 110 Quora Blog (https://stat110.quora.com/)
Petersburg paradox is an extreme example. See also Jensen’s
inequality. The right way to find E(g(X)) is with LOTUS. • Quora Probability FAQ (http://bit.ly/probabilityfaq)
• R Studio (https://www.rstudio.com)
• LaTeX File (github.com/wzchen/probability cheatsheet)
Distributions in R
The table above gives R commands for working with various named
distributions. Commands analogous to pbinom, qbinom, and rbinom
work for the other distributions in the table. For example, pnorm,
qnorm, and rnorm can be used to get the CDF, quantiles, and random
generation for the Normal. For the Multinomial, dmultinom can be used
for calculating the joint PMF and rmultinom can be used for generating
random vectors. For the Multivariate Normal, after installing and
loading the mvtnorm package dmvnorm can be used for calculating the
joint PDF and rmvnorm can be used for generating random vectors.
92
Table of Distributions
Bernoulli P (X = 1) = p
Bern(p) P (X = 0) = q = 1 p p pq q + pet
n k
Binomial P (X = k) = k
pk q n
Bin(n, p) k 2 {0, 1, 2, . . . n} np npq (q + pet )n
Geometric P (X = k) = q k p
p
Geom(p) k 2 {0, 1, 2, . . . } q/p q/p2 1 qet
, qet < 1
r+n 1
Negative Binomial P (X = n) = r 1
pr q n
p
NBin(r, p) n 2 {0, 1, 2, . . . } rq/p rq/p2 (1 qet
)r , qet < 1
w b w+b
Hypergeometric P (X = k) = k n k
/ n
⇣ ⌘⇣ ⌘ ⇣ ⌘
nw w+b n µ µ
HGeom(w, b, n) k 2 {0, 1, 2, . . . , n} µ= b+w w+b 1
nn (1 n
) messy
⇣ ⌘
k
e
Poisson P (X = k) = k!
Pois( ) k 2 {0, 1, 2, . . . } e (et 1)
1
Uniform f (x) = b a
a+b (b a)2 etb eta
Unif(a, b) x 2 (a, b) 2 12 t(b a)
2)
f (x) = p1 e (x µ)2/(2
Normal 2⇡ 2 t2
N (µ, 2 ) x 2 ( 1, 1) µ 2 etµ+ 2
Exponential f (x) = e x
1 1
Expo( ) x 2 (0, 1) 2 t
,t<
1 x1
f (x) = (a)
( x)a e x
Gamma
a a
Gamma(a, ) x 2 (0, 1) 2 t
,t<
⇣ ⌘a
(a+b) a 1 1
f (x) = (a) (b)
x (1 x)b
Beta
a µ(1 µ)
Beta(a, b) x 2 (0, 1) µ= a+b (a+b+1)
messy
1
Chi-Square xn/2 1 e x/2
2n/2 (n/2)
2 n 2n (1 2t) n/2 , t < 1/2
n x 2 (0, 1)
((n+1)/2) (n+1)/2
p
n⇡ (n/2)
(1 + x2 /n)
Student-t
n
tn x 2 ( 1, 1) 0 if n > 1 n 2
if n > 2 doesn’t exist
93
CS 229 – Machine Learning https://stanford.edu/~shervine
VIP Refresher: Linear Algebra and Calculus • outer product: for x œ Rm , y œ Rn , we have:
··· x1 yn
Afshine Amidi and Shervine Amidi xy T = .. ..
. .
œ Rm◊n
xm y 1 xm y n
3 x1 y1 4
···
October 6, 2018
r Matrix-vector multiplication – The product of matrix A œ Rm◊n and vector x œ Rn is a
vector of size Rm , such that:
General notations
aT
r,1 x n
r Vector – We note x œ Rn a vector with n entries, where xi œ R is the ith entry:
..
.
ac,i xi œ Rm
x2
Q R
T
ar,m x i=1
x= ..
ÿ
œ Rn
.
A x1 B Ax = a b=
xn
where aT
r,i are the vector rows and ac,j are the vector columns of A, and xi are the entries
r Matrix – We note A œ Rm◊n a matrix with m rows and n columns, where Ai,j œ R is the of x.
entry located in the ith row and j th column: r Matrix-matrix multiplication – The product of matrices A œ Rm◊n and B œ Rn◊p is a
1,1 · · · A1,n matrix of size Rn◊p , such that:
A= .. ..
. . aT aT
œ Rm◊n
r,1 bc,1 r,1 bc,p n
A A B
···
n◊p
Am,1 · · · Am,n
.. .. ac,i bT
. . r,i œ R
Q R
T T i=1
Remark: the vector x defined above can be viewed as a n ◊ 1 matrix and is more particularly
called a column-vector. ar,m bc,1 ar,m bc,p
ÿ
···
AB = a b=
T
r Identity matrix – The identity matrix I œ Rn◊n is a square matrix with ones in its diagonal
and zero everywhere else: where aT
r,i , br,i are the vector rows and ac,j , bc,j are the vector columns of A and B respec-
tively.
.. .. .
. r Transpose – The transpose of a matrix A œ Rm◊n , noted AT , is such that its entries are
flipped:
Q 1 0 ··· 0 R
.. . . ..
. . . 0
0 1 AT
i,j = Aj,i
c . .. d
0 ··· ’i,j,
I=a 0 b
. .. ..
.. . . 0 Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)≠1 =
0 0 dn
c d
···
B ≠1 A≠1
D=a 0 b
Remark: for matrices A,B, we have tr(AT ) = tr(A) and tr(AB) = tr(BA)
• inner product: for x,y œ Rn , we have:
n
r Determinant – The determinant of a square matrix A œ Rn◊n , noted |A| or det(A) is
xT y = xi yi œ R expressed recursively in terms of A\i,\j , which is the matrix A without its ith row and j th
i=1 column, as follows:
ÿ
n
A = AT and ’x œ Rn , xT Ax > 0
det(A) = |A| = (≠1)i+j Ai,j |A\i,\j |
j=1 Remark: similarly, a matrix A is said to be positive definite, and is noted A º 0, if it is a PSD
ÿ
r Norm – A norm is a function N : V ≠æ [0, + Œ[ where V is a vector space, and such that unitary, m ◊ n diagonal and V n ◊ n unitary matrices, such that:
for all x,y œ V , we have:
A=U VT
• N (x + y) 6 N (x) + N (y)
Manhattan, L1 ||x||1 |xi | LASSO regularization Remark: the gradient of f is only defined when f is a function that returns a scalar.
i=1 r Hessian – Let f : Rn æ R be a function and x œ Rn be a vector. The hessian of f with
ÿ
Remark: the hessian of f is only defined when f is a function that returns a scalar.
p-norm, Lp ||x||p xpi Hölder inequality r Gradient operations – For matrices A,B,C, the following gradient properties are worth
having in mind:
A n B p1
i=1
ÿ
Uniform convergence
ÒA tr(AB) = B T ÒAT f (A) = (ÒA f (A))T
Infinity, LŒ
i
||x||Œ max |xi |
Limits
Definitions
Precise Definition : We say lim f ( x ) = L if Limit at Infinity : We say lim f ( x ) = L if we
x Æa x Æ•
for every e > 0 there is a d > 0 such that can make f ( x ) as close to L as we want by
whenever 0 < x - a < d then f ( x ) - L < e . taking x large enough and positive.
Properties
Assume lim f ( x ) and lim g ( x ) both exist and c is any number then,
x Æa x Æa
Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
97
Calculus Cheat Sheet
Evaluation Techniques
Continuous Functions L’Hospital’s Rule
If f ( x ) is continuous at a then lim f ( x ) = f ( a ) f ( x) 0 f ( x) ± •
x Æa If lim = or lim = then,
x Æa g ( x ) 0 x Æa g ( x ) ±•
Continuous Functions and Composition f ( x) f ¢( x)
f ( x ) is continuous at b and lim g ( x ) = b then lim = lim a is a number, • or -•
x Æa g ( x ) x Æa g ¢ ( x )
x Æa
x Æa
( x Æa
)
lim f ( g ( x ) ) = f lim g ( x ) = f ( b ) Polynomials at Infinity
p ( x ) and q ( x ) are polynomials. To compute
Factor and Cancel
p ( x)
lim
x 2 + 4 x - 12
= lim
( x - 2 )( x + 6 ) lim
x Ʊ • q ( x )
factor largest power of x in q ( x ) out
xÆ2 x - 2x
2 xÆ2 x ( x - 2)
of both p ( x ) and q ( x ) then compute limit.
x+6 8
= lim
xÆ2 x
= =4
Rationalize Numerator/Denominator
2
lim
3x 2 - 4
= lim 2 5
( )
x 2 3 - 42
x
3 - 42
= lim 5 x = -
3
3- x 3- x 3+ x
x Æ-• 5 x - 2 x 2 x Æ-• x
( )
x -2
x Æ- •
x -2 2
lim 2 = lim 2 Piecewise Function
x Æ9 x - 81 x Æ9 x - 81 3 + x
Ï x 2 + 5 if x < -2
= lim 2
9- x
= lim
-1 lim g ( x ) where g ( x ) = Ì
( )
( x - 81) 3 + x xÆ9 ( x + 9 ) 3 + x ( ) Ó1 - 3x if x ≥ -2
x Æ-2
x Æ9
lim g ( x ) = lim+ 1 - 3 x = 7
Combine Rational Expressions x Æ-2+ x Æ-2
Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
98
Calculus Cheat Sheet
Derivatives
Definition and Notation
f ( x + h) - f ( x)
If y = f ( x ) then the derivative is defined to be f ¢ ( x ) = lim .
h Æ0 h
If y = f ( x ) then all of the following are If y = f ( x ) all of the following are equivalent
equivalent notations for the derivative. notations for derivative evaluated at x = a .
df dy d df dy
f ¢ ( x ) = y¢ = = = ( f ( x ) ) = Df ( x ) f ¢ ( a ) = y ¢ x =a = = = Df ( a )
dx dx dx dx x =a dx x =a
Common Derivatives
d d d x
dx
( x) = 1
dx
( csc x ) = - csc x cot x
dx
( a ) = a x ln ( a )
d d d x
dx
( sin x ) = cos x
dx
( cot x ) = - csc2 x
dx
( e ) = ex
d d 1 d 1
dx
( cos x ) = - sin x
dx
( sin -1 x ) =
dx
( ln ( x ) ) = , x > 0
x
1 - x2
d d 1
dx
( tan x ) = sec2 x d
( cos -1 x ) = -
1
dx
( ln x ) = x , x π 0
dx 1 - x2
d d 1
dx
( sec x ) = sec x tan x d
( tan -1 x ) =
1
dx
( log a ( x ) ) =
x ln a
, x>0
dx 1 + x2
Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
99
Calculus Cheat Sheet
2.
dx
e (
d f ( x)
)
= f ¢ ( x ) e f ( x) 6.
d
dx
( )
tan ÈÎ f ( x ) ˘˚ = f ¢ ( x ) sec 2 ÈÎ f ( x ) ˘˚
f ( x)
¢ d
3.
d
(
ln ÈÎ f ( x ) ˘˚ = ) 7. ( sec [ f ( x)]) = f ¢( x) sec [ f ( x)] tan [ f ( x)]
dx f ( x) dx
d f ¢( x)
4.
d
( )
sin ÈÎ f ( x ) ˘˚ = f ¢ ( x ) cos ÈÎ f ( x ) ˘˚ 8. (
tan -1 ÈÎ f ( x ) ˘˚ = )
1 + ÈÎ f ( x ) ˘˚
2
dx dx
Implicit Differentiation
¢
Find y if e 2 x -9 y
+ x y = sin ( y ) + 11x . Remember y = y ( x ) here, so products/quotients of x and y
3 2
will use the product/quotient rule and derivatives of y will use the chain rule. The “trick” is to
differentiate as normal and every time you differentiate a y you tack on a y¢ (from the chain rule).
After differentiating solve for y¢ .
e 2 x -9 y ( 2 - 9 y¢ ) + 3 x 2 y 2 + 2 x3 y y¢ = cos ( y ) y¢ + 11
11 - 2e 2 x -9 y - 3 x 2 y 2
2e 2 x -9 y
- 9 y¢e 2 x -9 y
+ 3x y + 2 x y y¢ = cos ( y ) y¢ + 11
2 2 3
fi y¢ = 3
2 x y - 9e2 x -9 y - cos ( y )
( 2 x y - 9e x
3 2 -9 y
- cos ( y ) ) y¢ = 11 - 2e2 x -9 y - 3 x 2 y 2
Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
100
Calculus Cheat Sheet
Extrema
Absolute Extrema Relative (local) Extrema
1. x = c is an absolute maximum of f ( x ) 1. x = c is a relative (or local) maximum of
if f ( c ) ≥ f ( x ) for all x in the domain. f ( x ) if f ( c ) ≥ f ( x ) for all x near c.
2. x = c is a relative (or local) minimum of
2. x = c is an absolute minimum of f ( x )
f ( x ) if f ( c ) £ f ( x ) for all x near c.
if f ( c ) £ f ( x ) for all x in the domain.
1st Derivative Test
Fermat’s Theorem If x = c is a critical point of f ( x ) then x = c is
If f ( x ) has a relative (or local) extrema at
1. a rel. max. of f ( x ) if f ¢ ( x ) > 0 to the left
x = c , then x = c is a critical point of f ( x ) .
of x = c and f ¢ ( x ) < 0 to the right of x = c .
Extreme Value Theorem 2. a rel. min. of f ( x ) if f ¢ ( x ) < 0 to the left
If f ( x ) is continuous on the closed interval of x = c and f ¢ ( x ) > 0 to the right of x = c .
[ a, b] then there exist numbers c and d so that, 3. not a relative extrema of f ( x ) if f ¢ ( x ) is
1. a £ c, d £ b , 2. f ( c ) is the abs. max. in the same sign on both sides of x = c .
[ a, b] , 3. f ( d ) is the abs. min. in [ a, b] . 2nd Derivative Test
If x = c is a critical point of f ( x ) such that
Finding Absolute Extrema
To find the absolute extrema of the continuous f ¢ ( c ) = 0 then x = c
function f ( x ) on the interval [ a, b ] use the 1. is a relative maximum of f ( x ) if f ¢¢ ( c ) < 0 .
following process. 2. is a relative minimum of f ( x ) if f ¢¢ ( c ) > 0 .
1. Find all critical points of f ( x ) in [ a, b] . 3. may be a relative maximum, relative
2. Evaluate f ( x ) at all points found in Step 1. minimum, or neither if f ¢¢ ( c ) = 0 .
3. Evaluate f ( a ) and f ( b ) .
4. Identify the abs. max. (largest function Finding Relative Extrema and/or
value) and the abs. min.(smallest function Classify Critical Points
value) from the evaluations in Steps 2 & 3. 1. Find all critical points of f ( x ) .
2. Use the 1st derivative test or the 2nd
derivative test on each critical point.
Newton’s Method
f ( xn )
If xn is the nth guess for the root/solution of f ( x ) = 0 then (n+1)st guess is xn +1 = xn -
f ¢ ( xn )
provided f ¢ ( xn ) exists.
Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
101
Calculus Cheat Sheet
Related Rates
Sketch picture and identify known/unknown quantities. Write down equation relating quantities
and differentiate with respect to t using implicit differentiation (i.e. add on a derivative every time
you differentiate a function of t). Plug in known quantities and solve for the unknown quantity.
Ex. A 15 foot ladder is resting against a wall. Ex. Two people are 50 ft apart when one
The bottom is initially 10 ft away and is being starts walking north. The angle q changes at
pushed towards the wall at 14 ft/sec. How fast 0.01 rad/min. At what rate is the distance
is the top moving after 12 sec? between them changing when q = 0.5 rad?
Optimization
Sketch picture if needed, write down equation to be optimized and constraint. Solve constraint for
one of the two variables and plug into first equation. Find critical points of equation in range of
variables and verify that they are min/max as needed.
Ex. We’re enclosing a rectangular field with Ex. Determine point(s) on y = x 2 + 1 that are
500 ft of fence material and one side of the closest to (0,2).
field is a building. Determine dimensions that
will maximize the enclosed area.
Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
102
Calculus Cheat Sheet
Integrals
Definitions
Definite Integral: Suppose f ( x ) is continuous Anti-Derivative : An anti-derivative of f ( x )
on [ a, b] . Divide [ a, b ] into n subintervals of is a function, F ( x ) , such that F ¢ ( x ) = f ( x ) .
width D x and choose x from each interval. Indefinite Integral : Ú f ( x ) dx = F ( x ) + c
*
i
•
b
f (x )D x . where F ( x ) is an anti-derivative of f ( x ) .
Ú a f ( x ) dx = nlim Â
*
Then i
Æ•
i =1
Part II : f ( x ) is continuous on [ a, b ] , F ( x ) is d u( x)
f ( t ) dt = u ¢ ( x ) f [ u ( x ) ] - v¢ ( x ) f [ v ( x ) ]
dx Ú v( x )
an anti-derivative of f ( x ) (i.e. F ( x ) = Ú f ( x ) dx )
b
then Ú f ( x ) dx = F ( b ) - F ( a ) .
a
Properties
Ú f ( x ) ± g ( x ) dx = Ú f ( x ) dx ± Ú g ( x ) dx Ú cf ( x ) dx = c Ú f ( x ) dx , c is a constant
b b b b b
Ú a f ( x ) ± g ( x ) dx = Ú a f ( x ) dx ± Ú a g ( x ) dx Ú a cf ( x ) dx = c Ú a f ( x ) dx , c is a constant
a b
Ú a f ( x ) dx = 0 Ú c dx = c ( b - a )
a
b a b b
Ú a f ( x ) dx = -Úb f ( x ) dx Ú f ( x ) dx £ Ú f ( x )
a a
dx
b c b
Ú f ( x ) dx = Ú f ( x ) dx + Ú f ( x ) dx for any value of c.
a a c
b b
If f ( x ) ≥ g ( x ) on a £ x £ b then Ú f ( x ) dx ≥ Ú g ( x ) dx
a a
b
If f ( x ) ≥ 0 on a £ x £ b then Ú f ( x ) dx ≥ 0
a
b
If m £ f ( x ) £ M on a £ x £ b then m ( b - a ) £ Ú f ( x ) dx £ M ( b - a )
a
Common Integrals
Ú k dx = k x + c Ú cos u du = sin u + c Ú tan u du = ln sec u + c
Ú x dx = n+1 x + c, n π -1 Ú sin u du = - cos u + c Ú sec u du = ln sec u + tan u + c
n 1 n +1
Ú a x + b dx = a ln ax + b + c
1 1
Ú sec u tan u du = sec u + c Ú a - u du = sin ( a ) + c
1
2 2
u -1
Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
103
Calculus Cheat Sheet
( )
Ú a f ( g ( x ) ) g ¢ ( x ) dx = Ú g (a ) f ( u ) du
b g b
u Substitution : The substitution u = g ( x ) will convert using
cos ( x3 ) dx cos ( x3 ) dx = Ú
2 2 8
Ex. Ú 1 5x
2
Ú 1 5x
2 5
1 3
cos ( u ) du
u = x 3 fi du = 3 x 2 dx fi x 2 dx = 13 du ( sin (8) - sin (1) )
8
= 53 sin ( u ) 1 = 5
3
x = 1 fi u = 1 = 1 :: x = 2 fi u = 2 = 8
3 3
b b b
Integration by Parts : Ú u dv = uv - Ú v du and Ú a u dv = uv a
- Ú v du . Choose u and dv from
a
Ú xe
-x 5
Ex. dx Ex. Ú3 ln x dx
u=x dv = e- x fi du = dx v = -e - x u = ln x dv = dx fi du = 1x dx v = x
Ú xe dx = - xe + Ú e dx = - xe - e
-x -x -x -x -x
+c
ln x dx = x ln x 3 - Ú dx = ( x ln ( x ) - x )
5 5 5 5
Ú3 3 3
= 5ln ( 5) - 3ln ( 3) - 2
Ú tan sin5 x
Ú cos x dx
3
Ex. x sec5 x dx Ex. 3
u 3 u 3
= 17 sec7 x - 15 sec5 x + c
= 12 sec2 x + 2 ln cos x - 12 cos 2 x + c
Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
104
Calculus Cheat Sheet
Trig Substitutions : If the integral contains the following root use the given substitution and
formula to convert into an integral involving trig functions.
a 2 - b 2 x 2 fi x = ab sin q b 2 x 2 - a 2 fi x = ba sec q a 2 + b 2 x 2 fi x = ab tan q
cos 2 q = 1 - sin 2 q tan 2 q = sec 2 q - 1 sec2 q = 1 + tan 2 q
Úx Û ( 23 cos q ) dq = Ú sin122 q dq
16
Ex. dx 16
2
4 -9 x 2 ı 4 sin 2 q ( 2cosq )
9
x = 23 sin q fi dx = 23 cos q dq
= Ú 12 csc 2 dq = -12 cot q + c
4 - 9x 2 = 4 - 4sin q = 4 cos q = 2 cos q
2 2
Use Right Triangle Trig to go back to x’s. From
Recall x 2 = x . Because we have an indefinite substitution we have sin q = 32x so,
integral we’ll assume positive and drop absolute
value bars. If we had a definite integral we’d
need to compute q ’s and remove absolute value
bars based on that and,
Ï x if x ≥ 0 4 -9 x 2
x =Ì From this we see that cot q = 3x
. So,
Ó- x if x < 0
4 -9 x 2
Úx dx = - 4 +c
16
In this case we have 4 - 9x = 2 cos q .
2 2
4 -9 x 2 x
P( x )
Partial Fractions : If integrating Ú Q( x) dx where the degree of P ( x ) is smaller than the degree of
Q ( x ) . Factor denominator as completely as possible and find the partial fraction decomposition of
the rational expression. Integrate the partial fraction decomposition (P.F.D.). For each factor in the
denominator we get term(s) in the decomposition according to the following table.
Ax + B A1 x + B1 Ak x + Bk
( ax + bx + c ) +L +
2 k
ax 2 + bx + c ax + bx + c ( ax 2 + bx + c )
2 k
ax + bx + c
2
7 x2 +13 x +C A( x2 + 4) + ( Bx + C ) ( x -1)
Ex. Ú ( x -1)( x 2
+4)
dx 7 x2 +13 x
2
( x -1)( x + 4 )
= A
x -1 + Bx
x2 + 4
= ( x -1)( x 2 + 4 )
An alternate method that sometimes works to find constants. Start with setting numerators equal in
previous example : 7 x 2 + 13x = A ( x 2 + 4 ) + ( Bx + C ) ( x - 1) . Chose nice values of x and plug in.
For example if x = 1 we get 20 = 5A which gives A = 4 . This won’t always work easily.
Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
105
Calculus Cheat Sheet
Applications of Integrals
b
Net Area : Ú a f ( x ) dx represents the net area between f ( x ) and the
x-axis with area above x-axis positive and area below x-axis negative.
Area Between Curves : The general formulas for the two main cases for each are,
b d
y = f ( x) fi A = Ú Èupper function ˘
Î ˚ - ÈÎlower function ˘˚ dx & x = f ( y) fi A = Ú È right function ˘
Î ˚ - ÈÎleft function ˘˚ dy
a c
If the curves intersect then the area of each portion must be found individually. Here are some
sketches of a couple possible situations and formulas for a couple of possible cases.
d
A = Ú f ( y ) - g ( y ) dy
b c b
A = Ú f ( x ) - g ( x ) dx c A = Ú f ( x ) - g ( x ) dx + Ú g ( x ) - f ( x ) dx
a a c
Ex. Axis : y = a > 0 Ex. Axis : y = a £ 0 Ex. Axis : y = a > 0 Ex. Axis : y = a £ 0
These are only a few cases for horizontal axis of rotation. If axis of rotation is the x-axis use the
y = a £ 0 case with a = 0 . For vertical axis of rotation ( x = a > 0 and x = a £ 0 ) interchange x and
y to get appropriate formulas.
Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
106
Calculus Cheat Sheet
Work : If a force of F ( x ) moves an object Average Function Value : The average value
b
b of f ( x ) on a £ x £ b is f avg = 1
Ú f ( x ) dx
in a £ x £ b , the work done is W = Ú F ( x ) dx b-a a
a
Arc Length Surface Area : Note that this is often a Calc II topic. The three basic formulas are,
b b b
L = Ú ds SA = Ú 2p y ds (rotate about x-axis) SA = Ú 2p x ds (rotate about y-axis)
a a a
where ds is dependent upon the form of the function being worked with as follows.
( ) ( dxdt ) ( )
2 2
dx if y = f ( x ) , a £ x £ b dt if x = f ( t ) , y = g ( t ) , a £ t £ b
dy 2 dy
ds = 1 + dx
ds = + dt
1+ ( ) ds = r 2 + ( ddrq ) dq if r = f (q ) , a £ q £ b
2 2
ds = dx
dy
dy if x = f ( y ) , a £ y £ b
With surface area you may have to substitute in for the x or y depending on your choice of ds to
match the differential in the ds. With parametric and polar you will always need to substitute.
Improper Integral
An improper integral is an integral with one or more infinite limits and/or discontinuous integrands.
Integral is called convergent if the limit exists and has a finite value and divergent if the limit
doesn’t exist or has infinite value. This is typically a Calc II topic.
Infinite Limit
• t b b
1. Ú f ( x ) dx = lim Ú f ( x ) dx 2. Ú• f ( x ) dx = lim Ú f ( x ) dx
a t Æ• a - t Æ-• t
• c •
3. Ú • f ( x ) dx = Ú • f ( x ) dx + Ú
- - c
f ( x ) dx provided BOTH integrals are convergent.
Discontinuous Integrand
b b b t
1. Discont. at a: Ú f ( x ) dx = lim+ Ú f ( x ) dx 2. Discont. at b : Ú f ( x ) dx = lim- Ú f ( x ) dx
a t Æa t a t Æb a
b c b
3. Discontinuity at a < c < b : Ú f ( x ) dx = Ú f ( x ) dx + Ú f ( x ) dx provided both are convergent.
a a c
b Dx
Trapezoid Rule : Ú f ( x ) dx ª 2 ÈÎ f ( x ) + 2 f ( x ) + +2 f ( x ) + L + 2 f ( x ) + f ( x )˘˚
a
0 1 2 n -1 n
b Dx
Simpson’s Rule : Ú f ( x ) dx ª 3 ÈÎ f ( x ) + 4 f ( x ) + 2 f ( x ) + L + 2 f ( x ) + 4 f ( x ) + f ( x )˘˚
a
0 1 2 n-2 n -1 n
Visit http://tutorial.math.lamar.edu for a complete set of Calculus notes. © 2005 Paul Dawkins
107
Regression (Machine Learning)
R Cheat Sheet
Summary:
Common Applications in Business: Used to predict a numeric value Parsnip (Machine Learning):
(e.g. forecasting sales, estimating prices, etc).
Key Concept: Data is usually in a rectangular format (like a spreadsheet) Model List (start here first)
with one column that is a target (e.g. price) and other columns that are Linear Regression & GLM
predictors (e.g. product category) Decision Tree
Gotchas: Random Forest
Preprocessing: Knowing when to preprocess data (normalize) Boosted Trees (XGBoost)
prior to machine learning step
SVM: Poly & Radial
Feature Engineering: Getting good features is more important
than applying complex models.
Keras (Deep Learning)
Parameter Tuning: Higher complexity models have many parameters
that can be tuned.
Interpretability: Some models are more explainable than others, H2O (ML & DL Framework)
meaning the estimates for each feature means something in relation to
the target. Other models are not interpretable and require additional tools MLR (ML Framework)
(e.g. LIME) to explain.
Linear Regression with Multiple Predictors
Each predictor (term) is interpretable meaning the value (estimate)
indicates an increase/decrease in the target
Terminology:
Supervised vs Unsupervised: Regression is a supervised technique
that requires training with a "target" (e.g. price of product or sales by Python Cheat
month). The algorithm learns by identifying relationships between the Sheet
target & the predictors (attributes related to the target like category of
product or month of sales).
Classification vs Regression: Classification aims to predict classes
(either binary yes/no or multi-class categorical). Regression aims to Scikit-Learn (Machine Learning):
predict a numeric value (e.g. product price = $4,233).
Preprocessing: Many algorithms require preprocessing, which Linear Regression
transforms the data into a format more suitable for the machine learning GLM (Elastic Net)
algorithm. A common example is "standardization" or scaling the feature Decision Tree (Regressor)
to be in a range of [0,1] (or close to it). Random Forest (Regressor)
Hyper Parameter & Tuning: Machine learning algorithms have many
AdaBoost (Regressor) ,
parameters that can be adjusted (e.g. learning rate in GBM). Tuning is the
XGBoost
process of systematically finding the optimum parameter values.
Cross Validation: Machine learning algorithms should be tuned on a SVM (Regressor)
validation set as opposed to a test set. Cross-validation is the process of
splitting the training set into multiple sets using a portion of the training Keras (Deep Learning)
set for tuning.
Decision Tree (Regression) Performance Metrics (Regression): Common performance metrics are H2O (ML & DL Framework)
Decisions are based on binary rules that split the nodes and terminate Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
at leaves. Regression trees estimate the value at each node. These measures provide an estimate of model performance to compare
models to each other.
Resources
Business Analysis With R Course (DS4B 101-R) - Data Science Courses
Modeling - Week 6 for Business
Business Science Problem Framework
Ultimate R Cheat Sheet | Ultimate Python Cheat Sheet
Business Science University
university.businessscience.io
108
version: 1.1
Machine Learning Algorithms Regression
Key Attributes Table
Business Science University
Data Science Courses for Business
university.businessscience.io
109
CS 229 – Machine Learning https://stanford.edu/~shervine
September 9, 2018
Regression Classifier
r Gradient descent – By noting – œ R the learning rate, the update rule for gradient descent
Outcome Continuous Class is expressed with the learning rate and the cost function J as follows:
Examples Linear regression Logistic regression, SVM, Naive Bayes ◊ Ω≠ ◊ ≠ –ÒJ(◊)
r Type of model – The different models are summed up in the table below:
Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training
example, and batch gradient descent is on a batch of training examples.
Illustration
r Likelihood – The likelihood of a model L(◊) given parameters ◊ is used to find the optimal
parameters ◊ through maximizing the likelihood. In practice, we use the log-likelihood ¸(◊) =
log(L(◊)) which is easier to optimize. We have:
◊opt = arg max L(◊)
Examples Regressions, SVMs GDA, Naive Bayes ◊
r Newton’s algorithm – The Newton’s algorithm is a numerical method that finds ◊ such
Notations and general concepts that ¸Õ (◊) = 0. Its update rule is as follows:
¸Õ (◊)
r Hypothesis – The hypothesis is noted h◊ and is the model that we choose. For a given input ◊Ω◊≠
¸ÕÕ (◊)
data x(i) , the model prediction output is h◊ (x(i) ).
Remark: the multidimensional generalization, also known as the Newton-Raphson method, has
r Loss function – A loss function is a function L : (z,y) œ R ◊ Y ‘≠æ L(z,y) œ R that takes as the following update rule:
inputs the predicted value z corresponding to the real data value y and outputs how different
they are. The common loss functions are summed up in the table below: ◊ Ω ◊ ≠ Ò2◊ ¸(◊) Ò◊ ¸(◊)
! "≠1
We assume here that y|x; ◊ ≥ N (µ,‡ 2 ) r Exponential family – A class of distributions is said to be in the exponential family if it can
be written in terms of a natural parameter, also called the canonical parameter or link function,
r Normal equations – By noting X the matrix design, the value of ◊ that minimizes the cost
÷, a sufficient statistic T (y) and a log-partition function a(÷) as follows:
function is a closed-form solution such that:
Remark: we will often have T (y) = y. Also, exp(≠a(÷)) can be seen as a normalization param-
r LMS algorithm – By noting – the learning rate, the update rule of the Least Mean Squares eter that will make sure that the probabilities sum to one.
(LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff
learning rule, is as follows: Here are the most common exponential distributions summed up in the following table:
m
(i)
’j, ◊j Ω ◊j + – y (i) ≠ h◊ (x(i) ) xj
i=1
Distribution ÷ T (y) a(÷) b(y)
ÿ # $
„
Bernoulli log 1≠„
y log(1 + exp(÷)) 1
Remark: the update rule is a particular case of the gradient ascent.
2
÷2
! "
r LWR – Locally Weighted Regression, also known as LWR, is a variant of linear regression that Gaussian µ y 2
Ô1
2fi
exp ≠ y2
weights each training example in its cost function by w(i) (x), which is defined with parameter
1 2
· œ R as: 1
Poisson log(⁄) y e÷
y!
(x(i) ≠ x)2
w(i) (x) = exp e÷
Geometric y log 1
≠
2· 2 log(1 ≠ „) 1≠e÷
3 4
! "
Classification and logistic regression r Assumptions of GLMs – Generalized Linear Models (GLM) aim at predicting a random
variable y as a function fo x œ Rn+1 and rely on the following 3 assumptions:
r Sigmoid function – The sigmoid function g, also known as the logistic function, is defined
as follows:
(1) y|x; ◊ ≥ ExpFamily(÷) (2) h◊ (x) = E[y|x; ◊] (3) ÷ = ◊T x
1
’z œ R, g(z) = œ]0,1[
1 + e≠z
Remark: ordinary least squares and logistic regression are special cases of generalized linear
models.
r Logistic regression – We assume here that y|x; ◊ ≥ Bernoulli(„). We have the following
form:
exp(◊iT x)
„i =
K
where (w, b) œ Rn ◊ R is the solution of the following optimization problem:
exp(◊jT x) 1
j=1
min ||w||2 such that y (i) (wT x(i) ≠ b) > 1
2
ÿ
y ≥ Bernoulli(„)
r Estimation – The following table sums up the estimates that we find when maximizing the
likelihood:
„
Remark: the line is defined as wT x ≠ b = 0 . m m
1 1
i=1 {y (i) =j}
x(i) 1
‚ ‚
m
µ‚j (j = 0,1)
r Hinge loss – The hinge loss is used in the setting of SVMs and is defined as follows:
(x(i) ≠ µy(i) )(x(i) ≠ µy(i) )T
m m
1{y(i) =1}
i=1 i=1 i=1
qm
1{y(i) =j}
ÿ ÿ
r Kernel – Given a feature mapping „, we define the kernel K to be defined as: Naive Bayes
K(x,z) = „(x)T „(z)
r Assumption – The Naive Bayes model supposes that the features of each data point are all
independent:
||x≠z||2
In practice, the kernel K defined by K(x,z) = exp is called the Gaussian kernel
n
≠ 2‡2
and is commonly used.
1 2
P (x|y) = P (x1 ,x2 ,...|y) = P (x1 |y)P (x2 |y)... = P (xi |y)
i=1
Ÿ
r Solutions – Maximizing the log-likelihood gives the following solutions, with k œ {0,1},
l œ [[1,L]]
(j)
1 #{j|y (j) = k and xi = l}
P (y = k) = ◊ #{j|y (j) = k} and P (xi = l|y = k) =
m #{j|y (j) = k}
Remark: we say that we use the "kernel trick" to compute the cost function using the kernel Remark: Naive Bayes is widely used for text classification and spam detection.
because we actually don’t need to know the explicit mapping „, which is often very complicated.
Instead, only the values K(x,z) are needed.
Tree-based and ensemble methods
r Lagrangian – We define the Lagrangian L(w,b) as follows:
l These methods can be used for both regression and classification problems.
L(w,b) = f (w) + —i hi (w) r CART – Classification and Regression Trees (CART), commonly known as decision trees,
i=1
can be represented as binary trees. They have the advantage to be very interpretable.
ÿ
Remark: the coefficients —i are called the Lagrange multipliers. r Random forest – It is a tree-based technique that uses a high number of decision trees
built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly
uninterpretable but its generally good performance makes it a popular algorithm.
Generative Learning Remark: random forests are a type of ensemble methods.
A generative model first tries to learn how the data is generated by estimating P (x|y), which r Boosting – The idea of boosting methods is to combine several weak learners to form a
we can then use to estimate P (y|x) by using Bayes’ rule. stronger one. The main ones are summed up in the table below:
r k-nearest neighbors – The k-nearest neighbors algorithm, commonly known as k-NN, is a • the training and testing sets follow the same distribution
non-parametric approach where the response of a data point is determined by the nature of its
k neighbors from the training set. It can be used in both classification and regression settings. • the training examples are drawn independently
Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the
higher the variance.
r Shattering – Given a set S = {x(1) ,...,x(d) }, and a set of classifiers H, we say that H shatters
S if for any set of labels {y (1) , ..., y (d) }, we have:
r Upper bound theorem – Let H be a finite hypothesis class such that |H| = k and let ” and
the sample size m be fixed. Then, with probability of at least 1 ≠ ”, we have:
1 2k
‘(h) 6 min ‘(h) + 2 log
hœH 2m ”
1 2 Ú 1 2
‚
d m 1 1
h) 6 min ‘(h) + O log + log
hœH m d m ”
1 2 3Ú 1 2 1 24
‘(‚
r Hoeffding inequality – Let Z1 , .., Zm be m iid variables drawn from a Bernoulli distribution
of parameter „. Let „
‚ be their sample mean and “ > 0 fixed. We have:
P (|„ ≠ „
Summary: K-Means
set.seed(0)
Common Applications in Business: Can be used for finding
segments within Customers, Companies, etc. kmeans_obj < kmeans(X, centers = 4)
Key Concept: Transform data into a matrix enabling trends to
be compared across units of measure (e.g. user-item matrix)
Gotchas: Data must be normalized or standardized to enable UMAP
comparison. This often requires calculation proportions of
values by customer, company, etc to ensure the larger values library(umap)
do not dominate the trend mining operation.
How Many Components/Clusters? Use a Scree Plot to umap_obj < umap(X)
determine the proportion of variance explained or total within
sum of squares
Python Cheat
Sheet
K-Means
from sklearn.cluster import KMeans
kmeans = KMeans(
n_clusters=4,
random_state=0).fit(X)
UMAP
import umap
reducer = umap.UMAP()
embedding = reducer.fit_transform(X)
Resources
Business Analysis With R Course (DS4B 101-R) - Data Science Courses
Modeling - Week 6 for Business
Business Science Problem Framework
Ultimate R Cheat Sheet | Ultimate Python Cheat Sheet
Business Science University
university.businessscience.io
114
version: 1.0
Segmentation & Clustering
How to apply KMeans & UMAP stepbystep
Clustering Workflow
Collect Data Standardize / Normalize Spread to User-Item Format
K-Means
Obtain cluster assignments
UMAP
Make 2D Projection
Business Science University
Data Science Courses for Business
university.businessscience.io
115
CS 229 – Machine Learning https://stanford.edu/~shervine
September 9, 2018
r Latent variables – Latent variables are hidden/unobserved variables that make estimation
1{c(i) =j}
problems difficult, and are often denoted z. Here are the most common settings where there are i=1
ÿ
latent variables:
• M-step: Use the posterior probabilities Qi (z (i) ) as cluster specific weights on data points
x(i) to separately re-estimate each cluster model as follows: Hierarchical clustering
Ward linkage Average linkage Complete linkage • Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
Minimize within cluster Minimize average distance Minimize maximum distance
distance between cluster pairs of between cluster pairs (i) m m
(i)
xj ≠ µj 1 (i) 1 (i)
xj Ω where µj = xj and ‡j2 = (xj ≠ µj )2
‡j m m
i=1 i=1
ÿ ÿ
other points in the same class, and between a sample and all other points in the next nearest
cluster, the silhouette coefficient s for a single sample is defined as follows: • Step 3: Compute u1 , ..., uk œ Rn the k orthogonal principal eigenvectors of , i.e. the
orthogonal eigenvectors of the k largest eigenvalues.
b≠a
s= • Step 4: Project the data on spanR (u1 ,...,uk ). This procedure maximizes the variance
max(a,b)
among all k-dimensional spaces.
k m
Bk = nc(i) (µc(i) ≠ µ)(µc(i) ≠ µ)T , Wk = (x(i) ≠ µc(i) )(x(i) ≠ µc(i) )T
j=1 i=1
ÿ ÿ
the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such
that the higher the score, the more dense and well separated the clusters are. It is defined as
follows:
Tr(Bk )
s(k) = ◊
N ≠k
Independent component analysis
Tr(Wk ) k≠1
It is a technique meant to find the underlying generating sources.
r Assumptions – We assume that our data x has been generated by the n-dimensional source
Principal component analysis vector s = (s1 ,...,sn ), where si are independent random variables, via a mixing and non-singular
matrix A as follows:
It is a dimension reduction technique that finds the variance maximizing directions onto which x = As
to project the data.
The goal is to find the unmixing matrix W = A≠1 by an update rule.
r Eigenvalue, eigenvector – Given a matrix A œ Rn◊n , ⁄ is said to be an eigenvalue of A if
there exists a vector z œ Rn \{0}, called eigenvector, such that we have: r Bell and Sejnowski ICA algorithm – This algorithm finds the unmixing matrix W by
following the steps below:
Az = ⁄z
• Write the probability of x = As = W ≠1 s as:
÷ diagonal, A = U UT
• Write the log likelihood given our training data {x(i) , i œ [[1,m]]} and by noting g the
Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of sigmoid function as:
matrix A.
m
r Algorithm – The Principal Component Analysis (PCA) procedure is a dimension reduction
technique that projects the data on k dimensions by maximizing the variance of the data as l(W ) = log g Õ
(wjT x(i) ) + log |W |
follows:
A n B
i=1 j=1
ÿ ÿ 1 2
Therefore, the stochastic gradient ascent learning rule is such that for each training example
x(i) , we update W as follows:
1 ≠ 2g(w1T x(i) )
T (i)
.
QQ R R
..
T x(i) )
cc1 ≠ 2g(w2 x )d x(i) T + (W T )≠1 d
1 ≠ 2g(wn
W Ω≠ W + – aa b b
VIP Cheatsheet: Machine Learning Tips r ROC – The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by
varying the threshold. These metrics are are summed up in the table below:
Given a set of data points {x(1) , ..., x(m) }, where each x(i) has n features, associated to a set of
outcomes {y (1) , ..., y (m) }, we want to assess a given classifier that learns how to predict y from r AUC – The area under the receiving operating curve, also noted AUC or AUROC, is the
x. area below the ROC as shown in the following figure:
Classification
In a context of a binary classification, here are the main metrics that are important to track to
assess the performance of the model.
r Confusion matrix – The confusion matrix is used to have a more complete picture when
assessing the performance of a model. It is defined as follows:
Predicted class
+ –
TP FN
+ False Negatives Regression
True Positives
Type II error r Basic metrics – Given a regression model f , the following metrics are commonly used to
Actual class assess the performance of the model:
FP TN
– False Positives Total sum of squares Explained sum of squares Residual sum of squares
True Negatives
Type I error m m m
SStot = (yi ≠ y)2 SSreg = (f (xi ) ≠ y)2 SSres = (yi ≠ f (xi ))2
r Main metrics – The following metrics are commonly used to assess the performance of
i=1 i=1 i=1
classification models:
ÿ ÿ ÿ
‡ 2 is an estimate of the variance associated with each response. LASSO Ridge Elastic Net
- Shrinks coefficients to 0 Makes coefficients smaller Tradeoff between variable
- Good for variable selection selection and small coefficients
where L is the likelihood and ‚
Model selection
r Vocabulary – When selecting a model, we distinguish 3 different parts of the data that we
have as follows:
Once the model has been chosen, it is trained on the entire dataset and tested on the unseen ... + ⁄||◊||1 ... + ⁄||◊||22 ... + ⁄ (1 ≠ –)||◊||1 + –||◊||22
test set. These are represented in the figure below:
Ë È
r Model selection – Train model on training set, then evaluate on the development set, then
pick best performance model on the development set, and retrain all of that model on the whole
training set.
r Cross-validation – Cross-validation, also noted CV, is a method that is used to select a
model that does not rely too much on the initial training set. The different types are summed
up in the table below:
Diagnostics
k-fold Leave-p-out
r Bias – The bias of a model is the difference between the expected prediction and the correct
- Training on k ≠ 1 folds and - Training on n ≠ p observations and model that we try to predict for given data points.
assessment on the remaining one assessment on the p remaining ones
r Variance – The variance of a model is the variability of the model prediction for given data
- Generally k = 5 or 10 - Case p = 1 is called leave-one-out points.
r Bias/variance tradeoff – The simpler the model, the higher the bias, and the more complex
The most commonly used method is called k-fold cross-validation and splits the training data the model, the higher the variance.
into k folds to validate the model on one fold while training the model on the k ≠ 1 other folds,
all of this k times. The error is then averaged over the k folds and is named cross-validation
error.
Underfitting Just right Overfitting
- High training error - Training error - Low training error
Symptoms - Training error close slightly lower than - Training error much
to test error test error lower than test error
- High bias - High variance
Regression
r Regularization – The regularization procedure aims at avoiding the model to overfit the
data and thus deals with high variance issues. The following table sums up the different types
of commonly used regularization techniques:
Classification
Deep learning
r Error analysis – Error analysis is analyzing the root cause of the difference in performance
between the current and the perfect models.
r Ablative analysis – Ablative analysis is analyzing the root cause of the difference in perfor-
mance between the current and the baseline models.
VIP Cheatsheet: Deep Learning r Learning rate – The learning rate, often noted ÷, indicates at which pace the weights get
updated. This can be fixed or adaptively changed. The current most popular method is called
Adam, which is a method that adapts the learning rate.
Neural networks are a class of models that are built with layers. Commonly used types of neural ˆL(z,y)
networks include convolutional and recurrent neural networks. w Ω≠ w ≠ ÷
ˆw
r Architecture – The vocabulary around neural networks architectures is described in the
figure below:
r Updating weights – In a neural network, weights are updated as follows:
where we note w, b, z the weight, bias and output respectively. r Dropout – Dropout is a technique meant at preventing overfitting the training data by
dropping out units in a neural network. In practice, neurons are either dropped with probability
r Activation function – Activation functions are used at the end of a hidden unit to introduce p or kept with probability 1 ≠ p.
non-linear complexities to the model. Here are the most common ones:
W ≠ F + 2P
N = +1
S
xi ≠ µ B
+—
r Cross-entropy loss – In the context of neural networks, the cross-entropy loss L(z,y) is 2 +‘
‡B
commonly used and is defined as follows:
xi Ω≠ “
L(z,y) = ≠ y log(z) + (1 ≠ y) log(1 ≠ z) It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
Ë È
r Types of gates – Here are the different types of gates that we encounter in a typical recurrent V0 (s) = 0
neural network:
• We iterate the value based on the values before:
Input gate Forget gate Output gate Gate
Write to cell or not? Erase a cell or not? Reveal a cell or not? How much writing? Õ Õ
Vi+1 (s) = R(s) + max “Psa (s )Vi (s )
aœA
C D
sÕ œS
ÿ
r LSTM – A long short-term memory (LSTM) network is a type of RNN model that avoids
the vanishing gradient problem by adding ’forget’ gates.
r Maximum likelihood estimate – The maximum likelihood estimates for the state transition
probabilities are as follows:
Reinforcement Learning and Control
#times took action a in state s and got to sÕ
Psa (sÕ ) =
The goal of reinforcement learning is for an agent to learn how to evolve in an environment. #times took action a in state s
r Markov decision processes – A Markov decision process (MDP) is a 5-tuple (S,A,{Psa },“,R)
where: r Q-learning – Q-learning is a model-free estimation of Q, which is done as follows:
r Value function – For a given policy fi and a given state s, we define the value function V fi
as follows:
ú
r Bellman equation – The optimal Bellman equations characterizes the value function V fi
of the optimal policy fi ú :
ú ú
V fi (s) = R(s) + max “ Psa (sÕ )V fi (sÕ )
aœA
sÕ œS
ÿ
Remark: we note that the optimal policy fi ú for a given state s is such that:
Advantages Drawbacks
VIP Cheatsheet: Recurrent Neural Networks - Possibility of processing input of any length - Computation being slow
- Model size not increasing with size of input - Difficulty of accessing information
- Computation takes into account from a long time ago
Afshine Amidi and Shervine Amidi historical information - Cannot consider any future input
- Weights are shared across time for the current state
One-to-many
Music generation
Tx = 1, Ty > 1
For each timestep t, the activation a<t> and the output y <t> are expressed as follows:
Many-to-one
a<t> = g1 (Waa a<t≠1> + Wax x<t> + ba ) and y <t> = g2 (Wya a<t> + by ) Sentiment classification
Tx > 1, Ty = 1
where Wax , Waa , Wya , ba , by are coefficients that are shared temporally and g1 , g2 activation
functions
Many-to-many
Name entity recognition
Tx = Ty
Many-to-many
Machine translation
Tx ”= Ty
The pros and cons of a typical RNN architecture are summed up in the table below: r Loss function – In the case of a recurrent neural network, the loss function L of all time
steps is defined based on the loss at every time step as follows: r Types of gates – In order to remedy the vanishing gradient problem, specific gates are used
in some types of RNNs and usually have a well-defined purpose. They are usually noted and
Ty are equal to:
y ,y) = y <t> ,y <t> )
t=1
= ‡(W x<t> + U a<t≠1> + b)
ÿ
L(‚ L(‚
Update gate u How much past should matter now? GRU, LSTM
Relevance gate r Drop previous information? GRU, LSTM
Handling long term dependencies
Forget gate f Erase a cell or not? LSTM
r Commonly used activation functions – The most common activation functions used in
RNN modules are described below: Output gate o How much to reveal of a cell? LSTM
r Vanishing/exploding gradient – The vanishing and exploding gradient phenomena are a<t> c<t> o ı c<t>
often encountered in the context of RNNs. The reason why they happen is that it is difficult
to capture long term dependencies because of multiplicative gradient that can be exponentially
decreasing/increasing with respect to the number of layers.
r Gradient clipping – It is a technique used to cope with the exploding gradient problem
sometimes encountered when performing backpropagation. By capping the maximum value for Dependencies
the gradient, this phenomenon is controlled in practice.
Remark: the sign ı denotes the element-wise multiplication between two vectors.
r Variants of RNNs – The table below sums up the other commonly used RNN architectures:
Bidirectional Deep
(BRNN) (DRNN)
r Skip-gram – The skip-gram word2vec model is a supervised learning task that learns word
embeddings by assessing the likelihood of any given target word t happening with a context
word c. By noting ◊t a parameter associated with t, the probability P (t|c) is given by:
exp(◊tT ec )
P (t|c) =
|V |
exp(◊jT ec )
j=1
ÿ
P (y = 1|c,t) = ‡(◊tT ec )
Remark: this method is less computationally expensive than the skip-gram model.
r GloVe – The GloVe model, short for global vectors for word representation, is a word em-
bedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of
times that a target i occurred with a context j. Its cost function J is as follows:
|V |
1
- Noted ow - Noted ew J(◊) = f (Xij )(◊iT ej + bi + bÕj ≠ log(Xij ))2
2
- Naive approach, no similarity information - Takes into account words similarity i,j=1
ÿ
r Word2vec – Word2vec is a framework aimed at learning word embeddings by estimating the Remark: the individual components of the learned word embeddings are not necessarily inter-
likelihood that a given word is surrounded by other words. Popular models include skip-gram, pretable.
negative sampling and CBOW.
Comparing words r Beam search – It is a heuristic search algorithm used in machine translation and speech
recognition to find the likeliest sentence y given an input x.
r Cosine similarity – The cosine similarity between words w1 and w2 is expressed as follows:
w1 · w2 • Step 1: Find top B likely words y <1>
similarity = = cos(◊)
||w1 || ||w2 ||
• Step 2: Compute conditional probabilities y <k> |x,y <1> ,...,y <k≠1>
Remark: ◊ is the angle between words w1 and w2 . • Step 3: Keep top B combinations x,y <1> ,...,y <k>
Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.
r Beam width – The beam width B is a parameter for beam search. Large values of B yield
to better result but with slower performance and increased memory. Small values of B lead to
worse results but is less computationally intensive. A standard value for B is around 10.
r Length normalization – In order to improve numerical stability, beam search is usually ap-
plied on the following normalized objective, often called the normalized log-likelihood objective,
defined as:
Ty
1
Language model Objective = log p(y <t> |x,y <1> , ..., y <t≠1> )
Ty–
t=1
ÿ Ë È
r Bleu score – The bilingual evaluation understudy (bleu) score quantifies how good a machine
Machine translation translation is by computing a similarity score based on n-gram precision. It is defined as follows:
r Overview – A machine translation model is similar to a language model except it has an n
encoder network placed before. For this reason, it is sometimes referred as a conditional language 1
bleu score = exp pk
model. The goal is to find a sentence y such that: n
A B
k=1
ÿ
countclip (n-gram)
n-gramœy
ÿ
pn =
count(n-gram)
‚
n-gramœy
ÿ
Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially
inflated bleu score.
‚
Attention
r Attention model – This model allows an RNN to pay attention to specific parts of the input
that is considered as being important, which improves the performance of the resulting model
Õ
in practice. By noting –<t,t > the amount of attention that the output y <t> should pay to the
activation a <tÕ
> and c <t> the context at time t, we have:
Õ
> <tÕ > Õ
>
c<t> = –<t,t a with –<t,t =1
tÕ tÕ
ÿ ÿ
Remark: the attention scores are commonly used in image captioning and machine translation.
r Attention weight – The amount of attention that the output y <t> should pay to the
Õ Õ
activation a<t > is given by –<t,t > computed as follows:
Õ
Õ
> exp(e<t,t >)
–<t,t =
Tx
ÕÕ
>
exp(e<t,t )
tÕÕ =1
ÿ
ı ı ı
r Architecture of a traditional CNN – Convolutional neural networks, also known as CNNs, r Fully Connected (FC) – The fully connected layer (FC) operates on a flattened input where
are a specific type of neural networks that are generally composed of the following layers: each input is connected to all neurons. If present, FC layers are usually found towards the end
of CNN architectures and can be used to optimize objectives such as class scores.
The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters
that are described in the next sections.
Filter hyperparameters
The convolution layer contains filters for which it is important to know the meaning behind its
Types of layer hyperparameters.
r Convolutional layer (CONV) – The convolution layer (CONV) uses filters that perform r Dimensions of a filter – A filter of size F ◊ F applied to an input containing C channels is
convolution operations as it is scanning the input I with respect to its dimensions. Its hyperpa- a F ◊ F ◊ C volume that performs convolutions on an input of size I ◊ I ◊ C and produces an
rameters include the filter size F and stride S. The resulting output O is called feature map or output feature map (also called activation map) of size O ◊ O ◊ 1.
activation map.
Remark: the application of K filters of size F ◊ F results in an output feature map of size
O ◊ O ◊ K.
Remark: the convolution step can be generalized to the 1D and 3D cases as well. r Stride – For a convolutional or a pooling operation, the stride S denotes the number of pixels
by which the window moves after each operation.
r Pooling (POOL) – The pooling layer (POOL) is a downsampling operation, typically applied
after a convolution layer, which does some spatial invariance. In particular, max and average
pooling are special kinds of pooling where the maximum and average value is taken, respectively.
r Zero-padding – Zero-padding denotes the process of adding P zeroes to each side of the CONV POOL FC
boundaries of the input. This value can either be manually specified or automatically set through
one of the three modes detailed below:
SÁ S
I Ë≠I+F ≠S
Input size Nin
Pend = 2
Pend = F ≠ 1 I ◊I ◊C I ◊I ◊C
Output size Nout
Ï Ì
O◊O◊K O◊O◊C
Number of
0
parameters
(F ◊ F ◊ C + 1) · K (Nin + 1) ◊ Nout
Illustration
- One bias parameter - Input is flattened
per filter - Pooling operation - One bias parameter
done channel-wise per neuron
Remarks - In most cases, S < F
- The number of FC
- Maximum padding - A common choice
- No padding - Padding such that feature - In most cases, S = F neurons is free of
such that end for K is 2C
I structural constraints
map size has size S convolutions are
- Drops last
Purpose applied on the limits
convolution if - Output size is
Ï Ì
of the input
dimensions do not mathematically convenient
match - Filter ’sees’ the input r Receptive field – The receptive field at layer k is the area denoted Rk ◊ Rk of the input
- Also called ’half’ padding that each pixel of the k-th activation map can ’see’. By calling Fj the filter size of layer j and
end-to-end
Si the stride value of layer i and with the convention S0 = 1, the receptive field at layer k can
be computed with the formula:
r Parameter compatibility in convolution layer – By noting I the length of the input Rk = 1 + (Fj ≠ 1) Si
volume size, F the length of the filter, P the amount of zero padding, S the stride, then the j=1 i=0
ÿ Ÿ
output size O of the feature map along that dimension is given by:
Remark: often times, Pstart = Pend , P , in which case we can replace Pstart + Pend by 2P in Commonly used activation functions
the formula above.
r Understanding the complexity of the model – In order to assess the complexity of a r Rectified Linear Unit – The rectified linear unit layer (ReLU) is an activation function g
model, it is often useful to determine the number of parameters that its architecture will have. that is used on all elements of the volume. It aims at introducing non-linearities to the network.
In a given layer of a convolutional neural network, it is done as follows: Its variants are summarized in the table below:
exj Bp fl Ba
j=1
IoU(Bp ,Ba ) =
ÿ
Bp fi Ba
Object detection
r Types of models – There are 3 main types of object recognition algorithms, for which the
nature of what is predicted is different. They are described in the table below:
Classification
Image classification Detection
w. localization
Remark: we always have IoU œ [0,1]. By convention, a predicted bounding box Bp is considered
as being reasonably good if IoU(Bp ,Ba ) > 0.5.
r Anchor boxes – Anchor boxing is a technique used to predict overlapping bounding boxes.
In practice, the network is allowed to predict more than one box simultaneously, where each box
- Classifies a picture - Detects object in a picture - Detects up to several objects prediction is constrained to have a given set of geometrical properties. For instance, the first
in a picture prediction can potentially be a rectangular box of a given form, while the second will be another
- Predicts probability of
rectangular box of a different geometrical form.
- Predicts probability object and where it is - Predicts probabilities of objects
of object located and where they are located r Non-max suppression – The non-max suppression technique aims at removing duplicate
overlapping bounding boxes of a same object by selecting the most representative ones. After
Traditional CNN Simplified YOLO, R-CNN YOLO, R-CNN having removed all boxes having a probability prediction lower than 0.6, the following steps are
repeated while there are boxes remaining:
r Detection – In the context of object detection, different methods are used depending on • Step 1: Pick the box with the largest prediction probability.
whether we just want to locate the object or detect a more complex shape in the image. The
two main ones are summed up in the table below: • Step 2: Discard any box having an IoU > 0.5 with the previous box.
r Types of models – Two main types of model are summed up in table below:
r YOLO – You Only Look Once (YOLO) is an object detection algorithm that performs the
following steps:
• Step 2: For each grid cell, run a CNN that predicts y of the following form:
where pc is the probability of detecting an object, bx ,by ,bh ,bw are the properties of the
detected bouding box, c1 ,...,cp is a one-hot representation of which of the p classes were r One Shot Learning – One Shot Learning is a face verification algorithm that uses a limited
detected, and k is the number of anchor boxes. training set to learn a similarity function that quantifies how different two given images are. The
similarity function applied to two images is often noted d(image 1, image 2).
• Step 3: Run the non-max suppression algorithm to remove any potential duplicate over-
lapping bounding boxes. r Siamese Network – Siamese Networks aim at learning how to encode images to then quantify
how different two images are. For a given input image x(i) , the encoded output is often noted
as f (x(i) ).
r Triplet loss – The triplet loss ¸ is a loss function computed on the embedding representation
of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive
example belong to a same class, while the negative example to another one. By calling – œ R+
the margin parameter, this loss is defined as follows:
Remark: when pc = 0, then the network does not detect any object. In that case, the corre-
sponding predictions bx , ..., cp have to be ignored.
r R-CNN – Region with Convolutional Neural Networks (R-CNN) is an object detection algo-
rithm that first segments the image to find potential relevant bounding boxes and then run the
detection algorithm to find most probable objects in those bounding boxes.
Remark: although the original algorithm is computationally expensive and slow, newer archi- r Motivation – The goal of neural style transfer is to generate an image G based on a given
tectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN. content C and a given style S.
1 [l](C) r ResNet – The Residual Network architecture (also called ResNet) uses residual blocks with a
Jcontent (C,G) = ||a ≠ a[l](G) ||2 high number of layers meant to decrease the training error. The residual block has the following
2 characterizing equation:
a[l+2] = g(a[l] + z [l+2] )
r Style matrix – The style matrix G[l] of a given layer l is a Gram matrix where each of its
[l]
elements GkkÕ quantifies how correlated the channels k and kÕ are. It is defined with respect to
r Inception Network – This architecture uses inception modules and aims at giving a try
activations a[l] as follows: at different convolutions in order to increase its performance. In particular, it uses the 1 ◊ 1
[l]
convolution trick to lower the burden of computation.
n [l]
nw
[l] [l] [l]
GkkÕ = aijk aijkÕ ı ı ı
i=1 j=1
ÿ
H ÿ
Remark: the style matrix for the style image and the generated image are noted G[l](S) and
G[l](G) respectively.
r Style cost function – The style cost function Jstyle (S,G) is used to determine how the
generated image G differs from the style S. It is defined as follows:
r Overall cost function – The overall cost function is defined as being a combination of the
content and style cost functions, weighted by parameters –,—, as follows:
Remark: a higher value of – will make the model care more about the content while a higher
value of — will make it care more about the style.
November 26, 2018 It is usually done after a fully connected/convolutional layer and before a non-linearity layer and
aims at allowing higher learning rates and reducing the strong dependence on initialization.
r Loss function – In order to quantify how a given model performs, the loss function L is
usually used to evaluate to what extent the actual outputs y are correctly predicted by the
model outputs z.
r Cross-entropy loss – In the context of binary classification in neural networks, the cross-
entropy loss L(z,y) is commonly used and is defined as follows:
- Random focus
L(z,y) = ≠ y log(z) + (1 ≠ y) log(1 ≠ z)
- Flipped with respect - Rotation with on one part of
Ë È
ˆL(z,y)
w Ω≠ w ≠ –
ˆw
- Nuances of RGB
- Addition of noise - Parts of image - Luminosity changes
is slightly changed
- More tolerance to ignored - Controls difference
- Captures noise r Updating weights – In a neural network, weights are updated as follows:
quality variation of - Mimics potential in exposition due
that can occur
inputs loss of parts of image to time of day
with light exposure • Step 1: Take a batch of training data and perform forward propagation to compute the
loss.
r Batch normalization – It is a step of hyperparameter “, — that normalizes the batch {xi }. • Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight.
2 the mean and variance of that we want to correct to the batch, it is done as
By noting µB , ‡B
follows: • Step 3: Use the gradients to update the weights of the network.
r Transfer learning – Training a deep learning model requires a lot of data and more impor-
tantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets Regularization
that took days/weeks to train, and leverage it towards our use case. Depending on how much
data we have at hand, here are the different ways to leverage this: r Dropout – Dropout is a technique used in neural networks to prevent overfitting the training
data by dropping out neurons with probability p > 0. It forces the model to avoid relying too
much on particular sets of features.
Remark: most deep learning frameworks parametrize dropout through the ’keep’ parameter 1≠p.
r Weight regularization – In order to make sure that the weights are not too large and that
Freezes most layers, the model is not overfitting the training set, regularization techniques are usually performed on
Medium trains weights on last the model weights. The main ones are summed up in the table below:
layers and softmax
r Learning rate – The learning rate, often noted – or sometimes ÷, indicates at which pace the
weights get updated. It can be fixed or adaptively changed. The current most popular method
is called Adam, which is a method that adapts the learning rate.
r Adaptive learning rates – Letting the learning rate vary when training a model can reduce
the training time and improve the numerical optimal solution. While Adam optimizer is the ... + ⁄||◊||1 ... + ⁄||◊||22 ... + ⁄ (1 ≠ –)||◊||1 + –||◊||22
most commonly used technique, others can also be useful. They are summed up in the table
Ë È
⁄œR
below:
⁄œR
⁄ œ R,– œ [0,1]
r Early stopping – This regularization technique stops the training process as soon as the
validation loss reaches a plateau or starts to increase.
Good practices
r Overfitting small batch – When debugging a model, it is often useful to make quick tests
to see if there is any major issue with the architecture of the model itself. In particular, in order
to make sure that the model can be properly trained, a mini-batch is passed inside the network
to see if it can overfit on it. If it cannot, it means that the model is either too complex or not
complex enough to even overfit on a small batch, let alone a normal-sized training set.
r Gradient checking – Gradient checking is a method used during the implementation of
the backward pass of a neural network. It compares the value of the analytical gradient to the
numerical gradient at given points and plays the role of a sanity-check for correctness.
ı ı ı
CONTINUED ON BACK →
DASK COLLECTIONS (CONTINUED) 141
ADVANCED
Read from distributed file systems or df = dd.read_parquet('s3://bucket/myfile.parquet')
cloud storage
Prepend prefixes like hdfs://, s3://, b = db.read_text('hdfs:///path/to/my-data.*.json')
or gcs:// to paths
Persist lazy computations in memory df = df.persist()
Compute multiple outputs at once dask.compute(x.min(), x.max())
CUSTOM COMPUTATIONS FOR CUSTOM CODE AND COMPLEX ALGORITHMS
DASK DELAYED LAZY PARALLELISM FOR CUSTOM CODE
Import import dask
Wrap custom functions with the @dask.delayed
@dask.delayed annotation def load(filename):
...
Delayed functions operate lazily, @dask.delayed
producing a task graph rather than def process(data):
executing immediately ...
Passing delayed results to other load = dask.delayed(load)
delayed functions creates process = dask.delayed(process)
dependencies between tasks
Call functions in normal code data = [load(fn) for fn in filenames]
results = [process(d) for d in data]
Compute results to execute in parallel dask.compute(results)
CONCURRENT.FUTURES ASYNCHRONOUS REAL-TIME PARALLELISM
Import from dask.distributed import Client
Start local Dask Client client = Client()
Submit individual task asynchronously future = client.submit(func, *args, **kwargs)
Block and gather individual result result = future.result()
Process results as they arrive for future in as_completed(futures):
...
EXAMPLE L = [client.submit(read, fn) for fn in filenames]
L = [client.submit(process, future) for future in L]
future = client.submit(sum, L)
result = future.result()
SET UP CLUSTER HOW TO LAUNCH ON A CLUSTER
MANUALLY
Start scheduler on one machine $ dask-scheduler
Scheduler started at SCHEDULER_ADDRESS:8786
Start workers on other machines host1$ dask-worker SCHEDULER_ADDRESS:8786
Provide address of the running scheduler host2$ dask-worker SCHEDULER_ADDRESS:8786
Start Client from Python process from dask.distributed import Client
client = Client('SCHEDULER_ADDRESS:8786')
ON A SINGLE MACHINE
Call Client() with no arguments for easy client = Client()
setup on a single host
CLOUD DEPLOYMENT
See dask-kubernetes project for Google Cloud pip install dask-kubernetes
See dask-ec2 project for Amazon EC2 pip install dask-ec2
MORE RESOURCES
User Documentation dask.pydata.org
Technical documentation for distributed scheduler distributed.readthedocs.org
Report a bug github.com/dask/dask/issues
('b', 2) DataCamp
>>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively
NLP Cheat Sheet Context-free grammar HMMs and the Viterbi decoder
A context-free grammar consists of: HMM is a Weighted Finite-state Automaton (WFSA)
Bag of Words Methods • a set of non-terminal symbols A, B, C, · · · 2 N • Each transition arc is associated with a probability
Information Retrieval • a set of terminal symbols a, b, c, · · · 2 T • The sum of all arcs outgoing from a single node is 1
Cosine similarity metric • a start symbol S 2 N Viterbi decoder’s time complexity is O(s2 n) and the space
complexity is O(sn) for s states and n words.
Define a similarity metric between topic vectors. It is the • a set of productions P of the form N ! (N U T )⇤
cosine of the angle between the term vectors. viterbi[s, t] = maxs0 (viterbi[s0 , t 1] ⇥ p(s, s0 ) ⇥ P (token[t]|s))
CFG can be used to generate sentences, recognize sentences or for each s, t: record which s, t 1 contributed the maximum.
ai ⇥ b i parse sentences.
2 2 Chunkers and name taggers
P
i ai ⇥ i bi Parsing
BIO (IOB) tags
sim(A, B) = qP i qP
Unusual words determine the topic much more than common The CKY algorithm
words. Weight each term frequency tfi by its inverse I ! inside a baseNP
document frequency idf . The grammar has to be in Chomsky normal form (CNF). O ! outside a baseNP
Every production must have the form A ! BC or A ! a. B ! start of a baseNP which follows another baseNP
N
idfi = log( )
ni • Use a parse table P T [i, j]
Maximum entropy
where N =size of collection and ni = number of documents – each entry is a list of grammar symbols MaxEnt is typically used for a multi-class classifier. We are
containing term i given a set of training data, where each datum is labeled with
– P T [i, j] includes x if the words from i up to (but
a set of features and a class (tag). Each feature-class pair
wi = tfi ⇥ idfi not including) j can be analyzed as an x
2
constitutes an indicator function. We train a classifier using
w2A,B tfw,A ⇥tfw,B ⇥idfw
this data, computing the s. We can then classify new data by
sim(A, B) = 2 2
P
– if there is a rule A ! a and word i is an a, put an MaxEnt models can be trained by providing a set of K features
Sentiment analysis (opinion mining) A in P T [i, i + 1] in the form of binary-valued indicator functions fi (h, t).
Opinion mining is the task of judging whether a document We will use a log-linear model, where
expresses a positive or a negative opinion (or no opinion)
– if there is a rule A ! BC, P T [i, j] includes A and fi (h,t)
P T [j, k] includes B, add C to P T [i, k] p(h, t) = (1/Z) K i=1 ↵i
regarding a particular object or topic.
where ↵i is the weight for feature i, and Z is a normalizing
Q
The simplest strategy uses a bag-of-words model. We create Time complexity = O(n3 ) constant.
lists of ’positive’ and ’negative’ words and judge a document Space complexity = O(n2 )
based on whether it has a preponderance of positive or Maximum Entropy Markov Models (MEMM)
negative words (and judge it neutral or ’objective’ if it has few
words of either category.
POS tagging Maximum entropy modeling can be combined with a Markov
model, so that, for each state, the probability of a transition to
Creating such lists is not easy; some of the words are likely to Accuracy that state is computed by a Max Ent model, based on the prior
be quite di↵erent for di↵erent kinds of topics. As an
for part-of-speech tagging, accuracy is a simple and reasonable state and arbitrary features of the input sequence. The result
alternative, we will take a collection of documents on some
metric. is a Maximum Entropy Markov Model (MEMM). Decoding
topic and rate them (by hand) as positive or negative. We will with correct tag
accuracy = tokenstotaltokens (selecting the best tag sequence) can be done deterministically
then use this collection to train a classifier model.
left-to-right or (better) using Viterbi decoding like an HMM.
c = argmaxc2C P (c) i2positions P (wi |c)
Precision and recall
count(docs labeled t) HMM vs MEMM
P (c) =
Q
short reviews, but fails for: precision = correct/response When using a package such as MaxEnt, the computational
• ambiguous terms recall = correct/key linguist’s job becomes one of feature engineering – identifying
Also the features which will be most predictive of the tags we are
• negation precision = True Positives/(True Positives + False Positives) trying to assign. Simple features are the current token (or the
recall = True Positives/(True Positives + False Negatives) previous token, or the next token) having a particular value, or
• comparative reviews
F-measure: the harmonic mean of recall and precision. being on a particular list (such as a list of common titles or
• revealing aspects of an opinion
precision+recall
F = 2 ⇥ precision⇥recall common first names).
144
Lexical semantics and word sense Learning Hobbs search
disambiguation Supervised learning: All training data is labeled In selecting which sentences to search, Hobbs starts with the
Semi-Supervised learning sentence containing the anaphor and then selects prior
Terminology sentences from right to left, examining first the immediately
• Part of training data is labeled
• multiple senses of a word preceding sentence. However, within an individual sentence he
• Make use of redundancies to learn labels of additional searches from left to right; this is the way he captures the
• polysemy (and homonymy for totally unrelated senses data, then train model preference for antecedents in subject position. The resulting
(”bank”))
• Co-training search order defines a Hobbs distance: if in searching for
• metonomy for certain types of regular, productive antecedents of X the d th candidate we consider is Y , we say
polysemy (”the White House”, ”Washington”) • Reduces amount of data which must be hand-labeled to that the Hobbs distance from X to Y is d.
achieve a given level of performance
• zeugma (conjunction combining distinct senses) as test
Active learning: Resolving pronoun reference
for polysemy (”serve”)
• Start with partially labeled data Constraints:
• synonymy: when two words mean (more-or-less) the
same thing
• animacy
• System selects additional ’informative’ examples for
user to label • gender
• hyponymy: X is the hyponym of Y if X denotes a more
specific subclass of Y (X is the hyponym, Y is the Unsupervised learning: driven entirely by an unlabeled • number
hypernym) corpus
Preferences:
Distant supervision: given a large amount of tabular
• synsets: synonym sets information, we use a heuristic procedure to (partially) label a • recent: at most 3 sentences back
text corpus. Distant supervision works roughly like a half
Word Sense Disambiguation • salient: mentioned several times recently
iteration of bootstrapping, but with a large set of seeds. Given
process of identifying the sense of a word in context a data base with a large set of instances of a relation and a • subjects:
Simple supervised WSD algorithm: naive Bayes: large raw text corpus, we use the instances to label the corpus Selectional preferences: prefer antecedent that is more likely to
= argmax(senses)P (s|F ) = and then train a classifier over the corpus. occur in context of pronoun.
P.positive
argmax(s)P (s) i P (f [i]|s) where F = {f 1, f 2, · · · } is the set Confidence of pattern P: Conf (P ) = P.positive+P.negative Ge, Hale, and Charniak formula:
of context features where P.positive = number of positive matches to pattern P
selected sense s’Q
model P (E). P (F |E) is learned from parallel corpus, P (E) EM training translation probabilities (E-step) following (??) and
can be learned from large monolingual corpus. (??)
Typically we are given sentence aligned but not word aligned
The Lexical Translation Model corpora, so we cannot compute these probabilities by direct • compute counts of aligned word pairs, weighted by
IBM Model 1 counting and instead use an iterative procedure called ”EM”. alignment probabilities (E-step)
Translation is based on alignment. Mapping a target word at • recompute MLE word translation probabilities from
position i to a source word at position j. Assumptions: ✏ these counts (M-step)
P (F, A|E) = P (F |E, A) ⇥ P (A|E) = t(fj , eaj )
• Each target word is generated by exactly one source (I + 1)J j
Y
word (1)
• repeat
• A target word can be generated by a NULL word;
multiple target words can be generated by the same P (F, A|E) Copyright c 2018 Antonio Mallia
source word (2) https://www.antoniomallia.it
A P (F, A|E)
P (A|F, E) = P
146
147
Basics
tokens >>> text1[0:100] - first 101 tokens
>>> text2[5] - fifth token
concordance >>> text3.concordance(‘begat’) - basic keyword-in-context
>>> text1.concordance(‘sea’, lines=100) - show other than default 25 lines
>>> text1.concordance(‘sea’, lines=all) - show all results
>>> text1.concordance(‘sea’, 10, lines=all) - change left and right context width to
10 characters and show all results
similar >>> text3.similar(‘silence’) - finds all words that share a common context
common_contexts >>>text1.common_contexts([‘sea’,’ocean’])
Counting
Count a string >>>len(‘this is a string of text’) – number of characters
Count a list of tokens >>>len(text1) –number of tokens
Make and count a list of >>>len(set(text1)) – notice that set return a list of unique tokens
unique tokens
Count occurrences >>> text1.count(‘heaven’) – how many times does a word occur?
Frequency >>>fd = nltk.FreqDist(text1) – creates a new data object that contains information
about word frequency
>>>fd[‘the’] – how many occurences of the word ‘the’
>>>fd.keys() – show the keys in the data object
>>>fd.values() – show the values in the data object
>>>fd.items() – show everything
>>>fd.keys()[0:50] – just show a portion of the info.
Frequency plots >>>fd.plot(50,cumulative=False) – generate a chart of the 50 most frequent words
Other FreqDist functions >>>fd.hapaxes()
>>>fd.freq(‘the’)
Get word lengths >>>lengths = [len(w) for w in text1]
And do FreqDist >>> fd = nltk.FreqDist(lengths)
FreqDist as a table >>>fd.tabulate()
Normalizing
De-punctuate >>>[w for w in text1 if w.isalpha() ] – not so much getting rid of punctuation, but
De-uppercaseify (?) keeping alphabetic characters
>>>[w.lower() for w in text] – make each word in the tokenized list lowercase
>>>[w.lower() for w in text if w.isalpha()] – all in one go
Sort >>>sorted(text1) – careful with this!
Unique words >>>set(text1) – set is oddly named, but very powerful. Leaves you with a list of
only one of each word.
Exclude stopwords Make your own list of word to be excluded:
>>>stopwords = [‘the’,’it’,’she’,’he’]
>>>mynewtext = [w for w in text1 if w not in stopwords]
Or you can also use predefined stopword lists from NLTK:
>>>from nltk.corpus import stopwords
>>>stopwords = stopwords.words(‘english’)
>>> mynewtext = [w for w in text1 if w not in stopwords]
148
Searching
Dispersion plot >>>text4.dispersion_plot([‘American’,’Liberty’,’Government’])
Find word that end with… >>>[w for w in text4 if w.endswith(‘ness’)]
Find words that start with… >>>[w for w in text4 if w.startsswith(‘ness’)]
Find words that contain… >>>[w for w in text4 if ‘ee’ in w]
Combine them together: >>>[w for w in text4 if ‘ee’ in w and w.endswith(‘ing’)]
Regular expressions ‘Regular expressions’ is a syntax for describing sequences of characters usually
used to construct search queries. The Python ‘re’ module must first be imported:
>>>import re
>>>[w for w in text1 if re.search('^ab',w)] – ‘Regular expressions’ is too big of a
topic to cover here. Google it!
Chunking
Collocations are good for getting a quick glimpse of what a text is about
Collocations >>> text4.collocations() - multi-word expressions that commonly co-occur. Notice
that is not necessarily related to the frequency of the words.
>>>text4.collocations(num=100) – alter the number of phrases returned
Bigrams, Trigrams, and n-grams are useful for comparing texts, particularly for
plagiarism detection and collation
Bi-grams >>>nltk.bigrams(text4) – returns every string of two words
Tri-grams >>>nltk.trigrams(text4) – return every string of three words
n-grams >>>nltk.ngrams(text4, 5)
Tagging
part-of-speech tagging >>>mytext = nltk.word_tokenize(“This is my sentence”)
>>> nltk.pos_tag(mytext)
Quitting Python
Quit >>>quit()
Part-of-Speech Codes
Resources
Commands for altering lists – useful in
Python for Humanists 1: Why Learn Python? creating stopword lists
http://www.rogerwhitson.net/?p=1260
list.append(x) - Add an item to the end of the list
list.insert(i, x) - Insert an item, i, at position, x.
‘Natural Language Processing with Python’ book online list.remove(x) - Remove item whose value is x.
http://www.nltk.org/book/ list.pop(x) - Remove item numer x from the list.
150
list(text) Split text into character tokens g=nltk.CFG.fromstring("""..."" Manually define grammar
")
set(text) Unique tokens
parser=nltk.ChartParser(g) Create a parser out of the
len(text) Number of characters
grammar
trees=parser.parse_all(text)
Accessing corpora and lexical resources
for tree in trees: ... print tree
from nltk.corpus import import CorpusReader object
brown from nltk.corpus import treebank
df=pd.DataFrame(time_sents, columns=['text'])
df['text'].str.split().str.len()
df['text'].str.contains('word')
df['text'].str.count(r'\d')
df['text'].str.findall(r'\d')
df['text'].str.replace(r'\w+day\b', '???')
df['text'].str.extract(r'(\d?\d):(\d\d)')
df['text'].str.extractall(r'((\d?\d):(\d\d) ?
([ap]m))')
df['text'].str.extractall(r'(?P<digits>\d)')
Introduction
According to industry estimates, only 21% of the available data is present in structured
form. Data is being generated as we speak, as we tweet, as we send messages on
Whatsapp and in various other activities. Majority of this data exists in the textual form,
which is highly unstructured in nature.
Few notorious examples include – tweets / posts on social media, user to user chat
conversations, news, blogs and articles, product or services reviews and patient records in
the healthcare sector. A few more recent ones includes chatbots and other voice driven bots.
Despite having high dimension data, the information present in it is not directly accessible
unless it is processed (read and understood) manually or analyzed by an automated system.
In order to produce significant and actionable insights from text data, it is important to get
acquainted with the techniques and principles of Natural Language Processing (NLP).
So, if you plan to create chatbots this year, or you want to use the power of unstructured text,
this guide is the right starting point. This guide unearths the concepts of natural language
processing, its techniques and implementation. The aim of the article is to teach the concepts
of natural language processing and apply it on real data set. Moreover, we also have a video
based course on NLP with 3 real life projects.
153
Table of Contents
1. Introduction to NLP
2. Text Preprocessing
o Noise Removal
o Lexicon Normalization
§ Lemmatization
§ Stemming
o Object Standardization
3. Text to Features (Feature Engineering on text data)
o Syntactical Parsing
§ Dependency Grammar
§ Part of Speech Tagging
o Entity Parsing
§ Phrase Detection
§ Named Entity Recognition
§ Topic Modelling
§ N-Grams
o Statistical features
§ TF – IDF
§ Frequency / Density Features
§ Readability Features
o Word Embeddings
4. Important tasks of NLP
o Text Classification
o Text Matching
§ Levenshtein Distance
§ Phonetic Matching
§ Flexible String Matching
o Coreference Resolution
o Other Problems
5. Important NLP libraries
154
Before moving further, I would like to explain some terms that are used in the article:
Download NLTK data: run python shell (in terminal) and write the following code:
```
Follow the instructions on screen and download the desired package or collection. Other
libraries can be directly installed using pip.
155
2. Text Preprocessing
Since, text is the most unstructured form of all the available data, various types of noise are
present in it and the data is not readily analyzable without any pre-processing. The entire
process of cleaning and standardization of text, making it noise-free and ready for analysis is
known as text preprocessing.
• Noise Removal
• Lexicon Normalization
• Object Standardization
For example – language stopwords (commonly used words of a language – is, am, the, of, in
etc), URLs or links, social media entities (mentions, hashtags), punctuations and industry
specific words. This step deals with removal of all types of noisy entities present in the text.
A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate
the text object by tokens (or by words), eliminating those tokens which are present in the
noise dictionary.
```
def _remove_noise(input_text):
words = input_text.split()
return noise_free_text
```
157
Another approach is to use the regular expressions while dealing with special patterns of
noise. We have explained regular expressions in detail in one of our previous
article. Following python code removes a regex pattern from the input text:
```
import re
for i in urls:
return input_text
regex_pattern = "#[\w]*"
```
158
For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of
the word – “play”, Though they mean different but contextually all are similar. The step
converts all the disparities of a word into their normalized form (also known as lemma).
Normalization is a pivotal step for feature engineering with text as it converts the high
dimensional features (N different features) to the low dimensional space (1 feature), which is
an ideal ask for any ML model.
Below is the sample code that performs lemmatization and stemming using python’s
popular library – NLTK.
159
```
lem = WordNetLemmatizer()
stem = PorterStemmer()
word = "multiplying"
lem.lemmatize(word, "v")
>> "multiply"
stem.stem(word)
>> "multipli"
```
160
Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs.
With the help of regular expressions and manually prepared data dictionaries, this type of
noise can be fixed, the code below uses a dictionary lookup method to replace social media
slangs from a text.
161
```
def _lookup_words(input_text):
words = input_text.split()
new_words = []
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
return new_text
```
162
Apart from three steps discussed so far, other types of text preprocessing includes encoding-
decoding noise, grammar checker, and spelling correction etc. The detailed article about
preprocessing and its methods is given in one of my previous article.
Dependency Trees – Sentences are composed of some words sewed together. The
relationship among the words in a sentence is determined by the basic dependency
grammar. Dependency grammar is a class of syntactic text analysis that deals with
(labeled) asymmetrical binary relations between two lexical items (words). Every relation
can be represented in the form of a triplet (relation, governor, dependent). For example:
consider the sentence – “Bills on ports and immigration were submitted by Senator
Brownback, Republican of Kansas.” The relationship among the words can be observed in
the form of a tree representation as shown:
163
The tree shows that “submitted” is the root word of this sentence, and is linked by two sub-
trees (subject and object subtrees). Each subtree is a itself a dependency tree with relations
such as – (“Bills” <-> “ports” <by> “proposition” relation), (“ports” <-> “immigration” <by>
“conjugation” relation).
This type of tree, when parsed recursively in top-down manner gives grammar relation triplets
as output which can be used as features for many nlp problems like entity wise sentiment
analysis, actor & entity identification, and text classification. The python
wrapper StanfordCoreNLP (by Stanford NLP Group, only commercial license) and NLTK
dependency grammars can be used to generate dependency trees.
Part of speech tagging – Apart from the grammar relations, every word in a sentence is also
associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos
tags defines the usage and function of a word in the sentence. H ere is a list of all possible
pos-tags defined by Pennsylvania university. Following code using NLTK performs pos
tagging annotation on input text. (it provides several implementations, the default one is
perceptron tagger)
164
```
tokens = word_tokenize(text)
print pos_tag(tokens)
anguage', 'NNP'),
```
A.Word sense disambiguation: Some language words have multiple meanings according
to their usage. For example, in the two sentences below:
“Book” is used with different context, however the part of speech tag for both of the cases
are different. In sentence I, the word “book” is used as v erb, while in II it is used as no un.
(Lesk Algorithm is also us ed for similar purposes)
165
Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)
Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1),
(“will_MD”, 1), (“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)
C. Normalization and Lemmatization: POS tags are the basis of lemmatization process for
converting a word to its base form (lemma).
D.Efficient stopword removal : P OS tags are also useful in efficient removal of stopwords.
For example, there are some tags which always define the low frequency / less important
words of a language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”,
“hundred”), (MD – “may”, “mu st” etc)
Topic Modelling & Named Entity Recognition are the two key entity detection methods in
NLP.
166
The process of detecting the named entities such as person names, location names,
company names etc from the text is called as NER. For example :
Sentence – Sergey Brin, the manager of Google Inc. is walking in the streets of New York.
Named Entities – ( “person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New
York”)
Noun phrase identification: This step deals with extracting all the noun phrases from a
text using dependency parsing and part of speech tagging.
Phrase classification: This is the classification step in which all the extracted noun
phrases are classified into respective categories (locations, names etc). Google Maps API
provides a good path to disambiguate locations, Then, the open databases from dbpedia,
wikipedia can be used to identify person names or company names. Apart from this, one
can curate the lookup tables and dictionaries by combining information from different
sources.
B. Topic Modeling
Topic modeling is a process of automatically identifying the topics present in a text corpus, it
derives the hidden patterns among the words in the corpus in an unsupervised manner.
Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic
model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”,
“crops”, “wheat” for a topic – “Farming”.
Latent Dirichlet Allocation (LDA) is the most popular topic modelling technique, Following is
the code to implement topic modeling using LDA in python. For a detailed explanation about
its working and implementation, check the complete article here.
167
```
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my fa
ther."
doc2 = "My father spends a lot of time driving my sister around to dance prac
tice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pre
ssure."
import corpora
# Creating the term dictionary of our corpus, where every unique term is assi
gned an index.
dictionary = corpora.Dictionary(doc_clean)
168
# Converting list of documents (corpus) into Document Term Matrix using dicti
Lda = gensim.models.ldamodel.LdaModel
# Results
print(ldamodel.print_topics())
```
169
C. N-Grams as Features
A combination of N words together are called N-Grams. N grams (N > 1) are generally more
informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are
considered as the most important features of all the others. The following code generates
bigram of a text.
```
words = text.split()
output = []
for i in range(len(words)-n+1):
output.append(words[i:i+n])
return output
```
170
TF-IDF is a weighted model commonly used for information retrieval problems. It aims to
convert the text documents into vector models on the basis of occurrence of words in the
documents without taking considering the exact ordering. For Example – let say there is a
dataset of N text documents, In any document “D”, TF and IDF will be defined as –
Term Frequency (TF) – TF for a term “t” is defined as the count of a term “t” in a document
“D”
Inverse Document Frequency (IDF) – IDF for a term is defined as logarithm of ratio of
total documents available in the corpus and number of documents containing the term T.
TF . IDF – TF IDF formula gives the relative importance of a term in a corpus (list of
documents), given by the following formula below. Following is the code using python’s scikit
learn package to convert a text into tf idf vectors:
171
```
obj = TfidfVectorizer()
le document text']
X = obj.fit_transform(corpus)
print X
>>>
(0, 1) 0.345205016865
(2, 1) 0.345205016865
(2, 4) 0.444514311537
```
The model creates a vocabulary dictionary and assigns an index to each word. Each row in
the output contains a tuple (i,j) and a tf-idf value of word at index j in document i.
172
Count or Density based features can also be used in models and analysis. These features
might seem trivial but shows a great impact in learning models. Some of the features are:
Word Count, Sentence Count, Punctuation Counts and Industry specific word counts. Other
types of measures include readability measures such as syllable counts, smog index and
flesch reading ease. Refer to Textstat library to create such features.
Word2Vec and GloVe are the two popular models to create word embedding of a text. These
models takes a text corpus as input and produces the word vectors as output.
```
>>> 0.11222489293
print model['learning']
```
They can be used as feature vectors for ML model, used to measure text similarity using
cosine similarity techniques, words clustering and text classification techniques.
174
A typical natural language classifier consists of two parts: (a) Training (b) Prediction as shown
in image below. Firstly the text input is processes and features are created. The machine
learning models then learn these features and is used for predicting against the new text.
175
Here is a code that uses naive bayes classifier using text blob library (built on top of nltk).
```
training_corpus = [
test_corpus = [
'Class_B')]
model = NBC(training_corpus)
>>> "Class_A"
>>> "Class_B"
print(model.accuracy(test_corpus))
>>> 0.83
```
177
```
from sklearn.feature_extraction.text
import classification_report
# preparing data for SVM model (using the same training_corpus, test_corpus f
train_data = []
train_labels = []
train_data.append(row[0])
train_labels.append(row[1])
test_data = []
178
test_labels = []
test_data.append(row[0])
test_labels.append(row[1])
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)
model = svm.SVC(kernel='linear')
model.fit(train_vectors, train_labels)
179
prediction = model.predict(test_vectors)
```
The text classification model are heavily dependent upon the quality and quantity of features,
while applying any machine learning model it is always a good practice to include more and
more training data. H ere are some tips that I wrote about improving the text classification
accuracy in one of my previous article.
180
A number of text matching techniques are available depending upon the requirement. This
section describes the important techniques in detail.
def levenshtein(s1,s2):
s1,s2 = s2,s1
distances = range(len(s1) + 1)
newDistances = [index2+1]
if char1 == char2:
newDistances.append(distances[index1])
else:
dex1+1], newDistances[-1])))
distances = newDistances
return distances[-1]
print(levenshtein("analyze","analyse"))
182
```
import fuzzy
soundex = fuzzy.Soundex(4)
print soundex('ankit')
>>> “A523”
print soundex('aunkit')
>>> “A523”
```
D. Cosine Similarity – W hen the text is represented as vector notation, a general cosine
similarity can also be applied in order to measure vectorized similarity. Following code
converts a text to vectors (using term frequency) and applies cosine similarity to provide
closeness among two text.
```
import math
if not denominator:
return 0.0
184
else:
def text_to_vector(text):
words = text.split()
return Counter(words)
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
>>> 0.62
```
185
Humans can quickly figure out that “he” denotes Donald (and not John), and that “it” denotes
the table (and not John’s office). Coreference Resolution is the component of NLP that does
this job automatically. It is used in document summarization, question answering, and
information extraction. Stanford CoreNLP provides a python wrapper for commercial
purposes.