You are on page 1of 51

Python For Data Science Cheat Sheet 3 Renderers & Visual Customizations

Bokeh Glyphs Grid Layout


Learn Bokeh Interactively at www.DataCamp.com, Scatter Markers >>> from bokeh.layouts import gridplot
taught by Bryan Van de Ven, core contributor >>> p1.circle(np.array([1,2,3]), np.array([3,2,1]), >>> row1 = [p1,p2]
fill_color='white') >>> row2 = [p3]
>>> p2.square(np.array([1.5,3.5,5.5]), [1,4,3], >>> layout = gridplot([[p1,p2],[p3]])
color='blue', size=1)
Plotting With Bokeh Line Glyphs Tabbed Layout
>>> p1.line([1,2,3,4], [3,4,5,6], line_width=2)
>>> p2.multi_line(pd.DataFrame([[1,2,3],[5,6,7]]), >>> from bokeh.models.widgets import Panel, Tabs
The Python interactive visualization library Bokeh >>> tab1 = Panel(child=p1, title="tab1")
pd.DataFrame([[3,4,5],[3,2,1]]),
enables high-performance visual presentation of color="blue") >>> tab2 = Panel(child=p2, title="tab2")
>>> layout = Tabs(tabs=[tab1, tab2])
large datasets in modern web browsers.
Customized Glyphs Also see Data
Linked Plots
Bokeh’s mid-level general purpose bokeh.plotting Selection and Non-Selection Glyphs
>>> p = figure(tools='box_select') Linked Axes
interface is centered around two main components: data >>> p.circle('mpg', 'cyl', source=cds_df, >>> p2.x_range = p1.x_range
and glyphs. selection_color='red', >>> p2.y_range = p1.y_range
nonselection_alpha=0.1) Linked Brushing
>>> p4 = figure(plot_width = 100,
+ = Hover Glyphs tools='box_select,lasso_select')
>>> from bokeh.models import HoverTool
>>> p4.circle('mpg', 'cyl', source=cds_df)
data glyphs plot >>> hover = HoverTool(tooltips=None, mode='vline')
>>> p5 = figure(plot_width = 200,
>>> p3.add_tools(hover)
tools='box_select,lasso_select')
The basic steps to creating plots with the bokeh.plotting >>> p5.circle('mpg', 'hp', source=cds_df)
interface are: US
Colormapping >>> layout = row(p4,p5)
1. Prepare some data: >>> from bokeh.models import CategoricalColorMapper
Asia
Europe

Python lists, NumPy arrays, Pandas DataFrames and other sequences of values
2. Create a new plot
>>> color_mapper = CategoricalColorMapper(
factors=['US', 'Asia', 'Europe'],
palette=['blue', 'red', 'green'])
4 Output & Export
3. Add renderers for your data, with visual customizations >>> p3.circle('mpg', 'cyl', source=cds_df, Notebook
color=dict(field='origin',
4. Specify where to generate the output transform=color_mapper), >>> from bokeh.io import output_notebook, show
5. Show or save the results legend='Origin') >>> output_notebook()
>>> from bokeh.plotting import figure
>>> from bokeh.io import output_file, show Legend Location HTML
>>> x = [1, 2, 3, 4, 5] Step 1
>>> y = [6, 7, 2, 4, 5] Inside Plot Area Standalone HTML
>>> p = figure(title="simple line example", Step 2 >>> p.legend.location = 'bottom_left' >>> from bokeh.embed import file_html
>>> from bokeh.resources import CDN
x_axis_label='x',
>>> html = file_html(p, CDN, "my_plot")
y_axis_label='y') Outside Plot Area
>>> p.line(x, y, legend="Temp.", line_width=2) Step 3 >>> from bokeh.models import Legend
>>> r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1]) >>> from bokeh.io import output_file, show
>>> output_file("lines.html") Step 4 >>> r2 = p2.line([1,2,3,4], [3,4,5,6]) >>> output_file('my_bar_chart.html', mode='cdn')
>>> show(p) Step 5 >>> legend = Legend(items=[("One" ,[p1, r1]),("Two",[r2])],
location=(0, -30)) Components
1 Data Also see Lists, NumPy & Pandas
>>> p.add_layout(legend, 'right')

Legend Orientation
>>> from bokeh.embed import components
>>> script, div = components(p)
Under the hood, your data is converted to Column Data
Sources. You can also do this manually: >>> p.legend.orientation = "horizontal" PNG
>>> import numpy as np >>> p.legend.orientation = "vertical"
>>> from bokeh.io import export_png
>>> import pandas as pd >>> export_png(p, filename="plot.png")
>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'], Legend Background & Border
[32.4,4,66, 'Asia'],
[21.4,4,109, 'Europe']]), >>> p.legend.border_line_color = "navy" SVG
columns=['mpg','cyl', 'hp', 'origin'], >>> p.legend.background_fill_color = "white"
index=['Toyota', 'Fiat', 'Volvo']) >>> from bokeh.io import export_svgs
>>> from bokeh.models import ColumnDataSource Rows & Columns Layout >>> p.output_backend = "svg"
>>> export_svgs(p, filename="plot.svg")
>>> cds_df = ColumnDataSource(df) Rows
>>> from bokeh.layouts import row

2 Plotting >>> layout = row(p1,p2,p3)


Columns
5 Show or Save Your Plots
>>> from bokeh.plotting import figure >>> from bokeh.layouts import columns >>> show(p1) >>> show(layout)
>>> p1 = figure(plot_width=300, tools='pan,box_zoom') >>> layout = column(p1,p2,p3) >>> save(p1) >>> save(layout)
>>> p2 = figure(plot_width=300, plot_height=300, Nesting Rows & Columns
x_range=(0, 8), y_range=(0, 8)) >>>layout = row(column(p1,p2), p3) DataCamp
>>> p3 = figure() Learn Python for Data Science Interactively
R For Data Science Cheat Sheet General form: DT[i, j, by] Advanced Data Table Operations
> DT[.N-1]
data.table “Take DT, subset rows using i, then calculate j grouped by by” > DT[,.N]
Return the penultimate row of the DT
Return the number of rows
> DT[,.(V2,V3)] Return V2 and V3 as a data.table
Learn R for data science Interactively at www.DataCamp.com
Adding/Updating Columns By Reference in j Using := >
>
DT[,list(V2,V3)]
DT[,mean(V3),by=.(V1,V2)]
Return V2 and V3 as a data.table
Return the result of j, grouped by all possible
> DT[,V1:=round(exp(V1),2)] V1 is updated by what is after := V1 V2 V1 combinations of groups specified in by
> DT Return the result by calling DT 1: 1 A 0.4053
2: 1 B 0.4053
V1 V2 V3 V4
data.table 1: 2.72 A -0.1107 1
2: 7.39 B -0.1427 2
3:
4:
1 C 0.4053
2 A -0.6443
5: 2 B -0.6443
data.table is an R package that provides a high-performance 3: 2.72 C -1.8893 3 6: 2 C -0.6443
4: 7.39 A -0.3571 4
version of base R’s data.frame with syntax and feature ... .SD & .SDcols
enhancements for ease of use, convenience and > DT[,c("V1","V2"):=list(round(exp(V1),2), Columns V1 and V2 are updated by
> DT[,print(.SD),by=V2] Look at what .SD contains
LETTERS[4:6])] what is after :=
programming speed. > DT[,':='(V1=round(exp(V1),2), Alternative to the above one. With [], > DT[,.SD[c(1,.N)],by=V2] Select the first and last row grouped by V2
V2=LETTERS[4:6])][] you print the result to the screen > DT[,lapply(.SD,sum),by=V2] Calculate sum of columns in .SD grouped by
Load the package: V1 V2 V3 V4
V2
1: 15.18 D -0.1107 1 > DT[,lapply(.SD,sum),by=V2, Calculate sum of V3 and V4 in .SD grouped by
> library(data.table) .SDcols=c("V3","V4")] V2
2: 1619.71 E -0.1427 2
V2 V3 V4

Creating A data.table
3: 15.18 F -1.8893 3
1: A -0.478 22
4: 1619.71 D -0.3571 4 2: B -0.478 26

> set.seed(45L) Create a data.table > DT[,V1:=NULL] Remove V1 3: C -0.478 30


> DT[,c("V1","V2"):=NULL] Remove columns V1 and V2 > DT[,lapply(.SD,sum),by=V2, Calculate sum of V3 and V4 in .SD grouped by
> DT <- data.table(V1=c(1L,2L), and call it DT .SDcols=paste0("V",3:4)] V2
V2=LETTERS[1:3], > Cols.chosen=c("A","B")
V3=round(rnorm(4),4), > DT[,Cols.Chosen:=NULL] Delete the column with column name
V4=1:12) Cols.chosen
> DT[,(Cols.Chosen):=NULL] Delete the columns specified in the Chaining
variable Cols.chosen
Subsetting Rows Using i > DT <- DT[,.(V4.Sum=sum(V4)),
by=V1]
Calculate sum of V4, grouped by V1

> DT[3:5,] Select 3rd to 5th row Indexing And Keys V1 V4.Sum
> DT[3:5] Select 3rd to 5th row 1: 1 36
> DT[V2=="A"] Select all rows that have value A in column V2 > setkey(DT,V2) A key is set on V2; output is returned invisibly 2: 2 42
> DT[V2 %in% c("A","C")] Select all rows that have value A or C in column V2 > DT["A"] Return all rows where the key column (set to V2) has > DT[V4.Sum>40] Select that group of which the sum is >40
V1 V2 V3 V4 the value A > DT[,.(V4.Sum=sum(V4)), Select that group of which the sum is >40
Manipulating on Columns in j by=V1][V4.Sum>40] (chaining)
1: 1 A -0.2392 1
2: 2 A -1.6148 4 V1 V4.Sum
3: 1 A 1.0498 7 1: 2 42
> DT[,V2] Return V2 as a vector 4: 2 A 0.3262 10 > DT[,.(V4.Sum=sum(V4)), Calculate sum of V4, grouped by V1,
[1] “A” “B” “C” “A” “B” “C” ... > DT[c("A","C")] Return all rows where the key column (V2) has value A or C by=V1][order(-V1)] ordered on V1
> DT[,.(V2,V3)] Return V2 and V3 as a data.table > DT["A",mult="first"] Return first row of all rows that match value A in key V1 V4.Sum
> DT[,sum(V1)] Return the sum of all elements of V1 in a column V2 1: 2 42
[1] 18 vector > DT["A",mult="last"] Return last row of all rows that match value A in key 2: 1 36
> DT[,.(sum(V1),sd(V3))] Return the sum of all elements of V1 and the column V2
V1 V2 std. dev. of V3 in a data.table > DT[c("A","D")] Return all rows where key column V2 has value A or D
1: 18 0.4546055
> DT[,.(Aggregate=sum(V1), The same as the above, with new names
V1 V2 V3 V4
1: 1 A -0.2392 1 set()-Family
Sd.V3=sd(V3))] 2: 2 A -1.6148 4
Aggregate Sd.V3 3: 1 A 1.0498 7 set()
1: 18 0.4546055 4: 2 A 0.3262 10
> DT[,.(V1,Sd.V3=sd(V3))] Select column V2 and compute std. dev. of V3, 5: NA D NA NA Syntax: for (i in from:to) set(DT, row, column, new value)
which returns a single value and gets recycled > DT[c("A","D"),nomatch=0] Return all rows where key column V2 has value A or D > rows <- list(3:4,5:6)
V1 V2 V3 V4
> DT[,.(print(V2), Print column V2 and plot V3 > cols <- 1:2
1: 1 A -0.2392 1
plot(V3), > for(i in seq_along(rows)) Sequence along the values of rows, and
2: 2 A -1.6148 4
NULL)] {set(DT, for the values of cols, set the values of
3: 1 A 1.0498 7
4: 2 A 0.3262 10 i=rows[[i]], those elements equal to NA (invisible)
j=cols[i],
Doing j by Group > DT[c("A","C"),sum(V4)] Return total sum of V4, for rows of key column V2 that
have values A or C value=NA)}
> DT[,.(V4.Sum=sum(V4)),by=V1] Calculate sum of V4 for every group in V1
V1 V4.Sum
> DT[c("A","C"),
sum(V4),
Return sum of column V4 for rows of V2 that have value A,
and anohter sum for rows of V2 that have value C setnames()
1: 1 36 by=.EACHI] Syntax: setnames(DT,"old","new")[]
2: 2 42 V2 V1
1: A 22 > setnames(DT,"V2","Rating") Set name of V2 to Rating (invisible)
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 Change 2 column names (invisible)
by=.(V1,V2)] and V2 2: C 30 > setnames(DT,
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in > setkey(DT,V1,V2) Sort by V1 and then by V2 within each group of V1 (invisible) c("V2","V3"),
by=sign(V1-1)] sign(V1-1) > DT[.(2,"C")] Select rows that have value 2 for the first key (V1) and the c("V2.rating","V3.DC"))
value C for the second key (V2)
setnames()
V1 V2 V3 V4
sign V4.Sum
1: 0 36 1: 2 C 0.3262 6
2: 1 42 2: 2 C -1.6148 12
Syntax: setcolorder(DT,"neworder")
> DT[,.(V4.Sum=sum(V4)), The same as the above, with new name > DT[.(2,c("A","C"))] Select rows that have value 2 for the first key (V1) and within
V1 V2 V3 V4 those rows the value A or C for the second key (V2) > setcolorder(DT, Change column ordering to contents
by=.(V1.01=sign(V1-1))] for the variable you’re grouping by
> DT[1:5,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 1: 2 A -1.6148 4 c("V2","V1","V4","V3")) of the specified vector (invisible)
2: 2 A 0.3262 10
by=V1] after subsetting on the first 5 rows
3: 2 C 0.3262 6
> DT[,.N,by=V1] Count number of rows for every group in
4: 2 C -1.6148 12
DataCamp
V1 Learn Python for Data Science Interactively
Data Science Cheat Sheet for Business Leaders
Data Science Basics
Types of Data Science Building a Data Science Team
Descrip:ve Analy:cs (Business Intelligence): Get useful data in front Your data team members require different skills for different purposes.
of the right people in the form of dashboards, reports, and emails
Machine Learning
Data Engineer Data Analyst Data Scien:st
- Which customers have churned? Engineer
- Which homes have sold in a given locaAon, and do homes of a
Store and maintain data Visualize and describe Write producAon-level Build custom models to
certain size sell more quickly?
data code to predict with data drive business decisions
Predic:ve Analy:cs (Machine Learning): Put data science models SQL/Java/Scala/Python SQL + BI Tools + Python/Java/R Python/R/SQL
conAnuously into producAon Spreadsheets
- Which customers may churn?
- How much will a home sell for, given its locaAon and number of
rooms?
Data Science Team Organiza:onal Models
Prescrip:ve Analy:cs (Decision Science): Use data to help a
company make decisions Centralized/isolated Embedded Hybrid

- What should we do about the parAcular types of customers The data team is the owner of Data experts are Data experts sit with funcAonal
data and answers requests dispersed across an teams and also report to the
that are prone to churn?
from other teams organizaAon and report Chief Data ScienAst—so data is
- How should we market a home to sell quickly, given its locaAon
to funcAonal leaders an organizaAonal priority
and number of rooms?
Data Engineering Design & Squad 1 Squad 2 Squad 3 Squad 1 Squad 2 Squad 3
Product
The Standard Data Science Workflow
Data
Data Collec:on: Compile data from different sources and store it
1
for efficient access

Explora:on and Visualiza:on: Explore and visualize data through


2
dashboards

Experimenta:on and Predic:on: The buzziest topic in data


3
science—machine learning!
datacamp.com/courses/data-science-for-business-leaders datacamp.com/business
Explora(on and Visualiza(on Experimenta(on and Predic(on
The type of dashboard you should use depends on what you’ll be using it for. Machine Learning
Common Dashboard Elements Machine learning is an applica,on of ar,ficial intelligence (AI) that builds algorithms and
sta,s,cal models to train data to address specific ques,ons without explicit instruc,ons.
Type What is it best for? Example
Supervised Machine Learning Unsupervised Machine Learning
Purpose Makes predic,ons from data Makes predic,ons by
Time series Tracking a value over ,me
with labels and features clustering data with no labels
into categories

Example Recommenda,on systems, email Image segmenta,on,


Stacked bar chart Tracking composi,on over ,me subject op,miza,on, churn predic,on customer segmenta,on

Bar chart Categorical comparison

Popular Dashboard Tools Special Topics in Machine Learning


Spreadsheets BI Tools Customized Tools Time Series Forecas(ng is a technique for predic,ng events through a sequence of
Excel Power BI R Shiny ,me and can capture seasonality or periodic events.

Sheets Tableau d3.js Natural Language Processing (NLP) allows computers to process and analyze large
amounts of natural language data.
Looker
- Text as input data
- Word counts track the important words in a text
- Word embeddings create features that group similar words
When You Should Request a Dashboard
Deep Learning / Neural Networks enables Explainable AI is an emerging field in
unsupervised machine learning using data machine learning that applies AI such that
When you’ll use it mul,ple ,mes that is unstructured or unlabeled. results can be easily understood.

Highly accurate predic,ons Understandable by humans


When you’ll need the informa,on updated regularly
BeRer for “What?” BeRer for “Why?"

When the request will always be the same

datacamp.com/courses/data-science-for-business-leaders datacamp.com/business
Data Science Cheatsheet
Compiled by Maverick Lin (http://mavericklin.com)
Last Updated August 13, 2018

What is Data Science? Probability Overview Descriptive Statistics


Multi-disciplinary field that brings together concepts Probability theory provides a framework for reasoning Provides a way of capturing a given data set or sample.
from computer science, statistics/machine learning, and about likelihood of events. There are two main types: centrality and variability
data analysis to understand and extract insights from the measures.
ever-increasing amounts of data. Terminology
Experiment: procedure that yields one of a possible set Centrality
Two paradigms of data research. of outcomes e.g. repeatedly tossing a die or coin Arithmetic Mean Useful to characterize symmetric
1. Hypothesis-Driven: Given a problem, what kind distributions without outliers µX = n1
P
Sample Space S: set of possible outcomes of an experi- x
of data do we need to help solve it? ment e.g. if tossing a die, S = (1,2,3,4,5,6 Geometric Mean Useful for averaging ratios. Always

2. Data-Driven: Given some data, what interesting Event E: set of outcomes of an experiment e.g. event less than arithmetic mean = n a1 a2 ...a3
problems can be solved with it? that a roll is 5, or the event that sum of 2 rolls is 7 Median Exact middle value among a dataset. Useful for
Probability of an Outcome s or P(s): number that skewed distribution or data with outliers.
The heart of data science is to always ask questions. Al- satisfies 2 properties Mode Most frequent element in a dataset.
ways be curious about the world. 1. Pfor each outcome s, 0 ≤ P(s) ≤ 1
1. What can we learn from this data? 2. p(s) = 1 Variability
2. What actions can we take once we find whatever it Standard Deviation Measures the squares differences
is we are looking for? Probability of Event E: sum of thePprobabilities of the between the individual elements
q PN and the mean
2
outcomes of the experiment: p(E) = s⊂E p(s) i=1 (xi −x)
σ= N −1
Types of Data Random Variable V: numerical function on the out- Variance V = σ 2
comes of a probability space
Structured: Data that has predefined structures. e.g.
Expected Value of Random Variable V: E(V) = Interpreting Variance
tables, spreadsheets, or relational databases. P
Unstructured Data: Data with no predefined struc- s⊂S p(s) * V(s) Variance is an inherent part of the universe. It is impossi-
ture, comes in any size or form, cannot be easily stored ble to obtain the same results after repeated observations
Independence, Conditional, Compound of the same event due to random noise/error. Variance
in tables. e.g. blobs of text, images, audio
Independent Events: A and B are independent iff: can be explained away by attributing to sampling or
Quantitative Data: Numerical. e.g. height, weight
P(A ∩ B) = P(A)P(B) measurement errors. Other times, the variance is due to
Categorical Data: Data that can be labeled or divided
P(A|B) = P(A) the random fluctuations of the universe.
into groups. e.g. race, sex, hair color.
P(B|A) = P(B)
Big Data: Massive datasets, or data that contains
; Conditional Probability: P(A|B) = P(A,B)/P(B) Correlation Analysis
greater variety arriving in increasing volumes and with
Bayes Theorem: P(A|B) = P(B|A)P(A)/P(B) Correlation coefficients r(X,Y) is a statistic that measures
ever-higher velocity (3 Vs). Cannot fit in the memory of
Joint Probability: P(A,B) = P(B|A)P(A) the degree that Y is a function of X and vice versa.
a single machine.
Marginal Probability: P(A) Correlation values range from -1 to 1, where 1 means
Data Sources/Fomats fully correlated, -1 means negatively-correlated, and 0
Probability Distributions means no correlation.
Most Common Data Formats CSV, XML, SQL,
Probability Density Function (PDF) Gives the prob- Pearson Coefficient Measures the degree of the rela-
JSON, Protocol Buffers
ability that a rv takes on the value x: pX (x) = P (X = x) tionship between linearly related variables
Data Sources Companies/Proprietary Data, APIs, Gov-
Cumulative Density Function (CDF) Gives the prob- r = Cov(X,Y )
ernment, Academic, Web Scraping/Crawling σ(X)σ(Y )
ability that a random variable is less than or equal to x: Spearman Rank Coefficient Computed on ranks and
FX (x) = P (X ≤ x) depicts monotonic relationships
Main Types of Problems Note: The PDF and the CDF of a given random variable
Two problems arise repeatedly in data science. contain exactly the same information. Note: Correlation does not imply causation!
Classification: Assigning something to a discrete set of
possibilities. e.g. spam or non-spam, Democrat or Repub-
lican, blood type (A, B, AB, O)
Regression: Predicting a numerical value. e.g. some-
one’s income, next year GDP, stock price
Data Cleaning Feature Engineering Statistical Analysis
Data Cleaning is the process of turning raw data into Feature engineering is the process of using domain knowl- Process of statistical reasoning: there is an underlying
a clean and analyzable data set. ”Garbage in, garbage edge to create features or input variables that help ma- population of possible things we can potentially observe
out.” Make sure garbage doesn’t get put in. chine learning algorithms perform better. Done correctly, and only a small subset of them are actually sampled (ide-
it can help increase the predictive power of your models. ally at random). Probability theory describes what prop-
Errors vs. Artifacts Feature engineering is more of an art than science. FE is erties our sample should have given the properties of the
1. Errors: information that is lost during acquisi- one of the most important steps in creating a good model. population, but statistical inference allows us to deduce
tion and can never be recovered e.g. power outage, As Andrew Ng puts it: what the full population is like after analyzing the sample.
crashed servers
“Coming up with features is difficult, time-consuming,
2. Artifacts: systematic problems that arise from
requires expert knowledge. ‘Applied machine learning’ is
the data cleaning process. these problems can be
basically feature engineering.”
corrected but we must first discover them
Continuous Data
Data Compatibility Raw Measures: data that hasn’t been transformed yet
Data compatibility problems arise when merging datasets. Rounding: sometimes precision is noise; round to
Make sure you are comparing ”apples to apples” and nearest integer, decimal etc..
not ”apples to oranges”. Main types of conver- Scaling: log, z-score, minmax scale
sions/unifications: Imputation: fill in missing values using mean, median, Sampling From Distributions
• units (metric vs. imperial) model output, etc.. Inverse Transform Sampling Sampling points from
• numbers (decimals vs. integers), Binning: transforming numeric features into categorical a given probability distribution is sometimes necessary
• names (John Smith vs. Smith, John), ones (or binned) e.g. values between 1-10 belong to A, to run simulations or whether your data fits a particular
• time/dates (UNIX vs. UTC vs. GMT), between 10-20 belong to B, etc. distribution. The general technique is called inverse
• currency (currency type, inflation-adjusted, divi- Interactions: interactions between features: e.g. sub- transform sampling or Smirnov transform. First draw
dends) traction, addition, multiplication, statistical test a random number p between [0,1]. Compute value x
Statistical: log/power transform (helps turn skewed such that the CDF equals p: FX (x) = p. Use x as the
Data Imputation distributions more normal), Box-Cox value to be the random value drawn from the distribution
Process of dealing with missing values. The proper meth- Row Statistics: number of NaN’s, 0’s, negative values, described by FX (x).
ods depend on the type of data we are working with. Gen- max, min, etc
eral methods include: Dimensionality Reduction: using PCA, clustering, Monte Carlo Sampling In higher dimensions, correctly
• Drop all records containing missing data factor analysis etc sampling from a given distribution becomes more tricky.
• Heuristic-Based: make a reasonable guess based on Generally want to use Monte Carlo methods, which
knowledge of the underlying domain Discrete Data typically follow these rules: define a domain of possible
• Mean Value: fill in missing data with the mean Encoding: since some ML algorithms cannot work on inputs, generate random inputs from a probability
• Random Value categorical data, we need to turn categorical data into nu- distribution over the domain, perform a deterministic
• Nearest Neighbor: fill in missing data using similar merical data or vectors calculation, and analyze the results.
data points Ordinal Values: convert each distinct feature into a ran-
• Interpolation: use a method like linear regression to dom number (e.g. [r,g,b] becomes [1,2,3])
predict the value of the missing data One-Hot Encoding: each of the m features becomes a
vector of length m with containing only one 1 (e.g. [r, g,
Outlier Detection b] becomes [[1,0,0],[0,1,0],[0,0,1]])
Outliers can interfere with analysis and often arise from Feature Hashing Scheme: turns arbitrary features into
mistakes during data collection. It makes sense to run a indices in a vector or matrix
”sanity check”. Embeddings: if using words, convert words to vectors
(word embeddings)
Miscellaneous
Lowercasing, removing non-alphanumeric, repairing,
unidecode, removing unknown characters

Note: When cleaning data, always maintain both the raw


data and the cleaned version(s). The raw data should be
kept intact and preserved for future use. Any type of data
cleaning/analysis should be done on a copy of the raw
data.
Classic Statistical Distributions Modeling- Overview Modeling- Philosophies
Binomial Distribution (Discrete) Modeling is the process of incorporating information into Modeling is the process of incorporating information
Assume X is distributed Bin(n,p). X is the number of a tool which can forecast and make predictions. Usually, into a tool which can forecast and make predictions.
”successes” that we will achieve in n independent trials, we are dealing with statistical modeling where we want Designing and validating models is important, as well as
where each trial is either a success or failure and each to analyze relationships between variables. Formally, we evaluating the performance of models. Note that the best
success occurs with the same probability p and each want to estimate a function f (X) such that: forecasting model may not be the most accurate one.
failure occurs with probability q=1-p.
Y = f (X) + 
PDF: P (X = x) = nk px (1 − p)n−x

Philosophies of Modeling
EV: µ = np Variance = npq where X = (X1 , X2 , ...Xp ) represents the input variables, Occam’s Razor Philosophical principle that the simplest
Y represents the output variable, and  represents random explanation is the best explanation. In modeling, if we
Normal/Gaussian Distribution (Continuous) error. are given two models that predicts equally well, we should
Assume X in distributed N (µ, σ 2 ). It is a bell-shaped choose the simpler one. Choosing the more complex one
and symmetric distribution. Bulk of the values lie close Statistical learning is set of approaches for estimating can often result in overfitting.
to the mean and no value is too extreme. Generalization this f (X). Bias Variance Trade-Off Inherent part of predictive
of the binomial distribution2 as 2n → ∞. modeling, where models with lower bias will have higher
PDF: P (x) = σ√12π e−(x−µ) /2σ Why Estimate f(X)? variance and vice versa. Goal is to achieve low bias and
EV: µ Variance: σ 2 Prediction: once we have a good estimate fˆ(X), we can low variance.
Implications: 68%-95%-99% rule. 68% of probability use it to make predictions on new data. We treat fˆ as a • Bias: error from incorrect assumptions to make tar-
mass fall within 1σ of the mean, 95% within 2σ, and black box, since we only care about the accuracy of the get function easier to learn (high bias → missing rel-
99.7% within 3σ. predictions, not why or how it works. evant relations or underfitting)
Inference: we want to understand the relationship • Variance: error from sensitivity to fluctuations in
Poisson Distribution (Discrete) between X and Y. We can no longer treat fˆ as a black the dataset, or how much the target estimate would
Assume X is distributed Pois(λ). Poisson expresses box since we want to understand how Y changes with differ if different training data was used (high vari-
the probability of a given number of events occurring respect to X = (X1 , X2 , ...Xp ) ance → modeling noise or overfitting)
in a fixed interval of time/space if these events occur
independently and with a known constant rate λ. More About 
−λ x
PDF: P (x) = e x!λ EV: λ Variance = λ The error term  is composed of the reducible and irre-
ducible error, which will prevent us from ever obtaining a
Power Law Distributions (Discrete) perfect fˆ estimate.
Many data distributions have much longer tails than • Reducible: error that can potentially be reduced
the normal or Poisson distributions. In other words, by using the most appropriate statistical learning
the change in one quantity varies as a power of another technique to estimate f . The goal is to minimize
quantity. It helps measure the inequality in the world. the reducible error.
e.g. wealth, word frequency and Pareto Principle (80/20 • Irreducible: error that cannot be reduced no
No Free Lunch Theorem No single machine learning
Rule) matter how well we estimate f . Irreducible error is
algorithm is better than all the others on all problems.
PDF: P(X=x) = cx−α , where α is the law’s exponent unknown and unmeasurable and will always be an
It is common to try multiple models and find one that
and c is the normalizing constant upper bound for .
works best for a particular problem.
Note: There will always be trade-offs between model
Thinking Like Nate Silver
flexibility (prediction) and model interpretability (infer-
1. Think Probabilistically Probabilistic forecasts are
ence). This is just another case of the bias-variance trade-
more meaningful than concrete statements and should be
off. Typically, as flexibility increases, interpretability de-
reported as probability distributions (including σ along
creases. Much of statistical learning/modeling is finding a
with mean prediction µ.
way to balance the two.
2. Incorporate New Information Use live models,
which continually updates using new information. To up-
date, use Bayesian reasoning to calculate how probabilities
change in response to new evidence.
3. Look For Consensus Forecast Use multiple distinct
sources of evidence. Ssome models operate this way, such
as boosting and bagging, which uses large number of weak
classifiers to produce a strong one.
Modeling- Taxonomy Modeling- Evaluation Metrics Modeling- Evaluation Environment
There are many different types of models. It is important Need to determine how good our model is. Best way to Evaluation metrics provides use with the tools to estimate
to understand the trade-offs and when to use a certain assess models is out-of-sample predictions (data points errors, but what should be the process to obtain the
type of model. your model has never seen). best estimate? Resampling involves repeatedly drawing
samples from a training set and refitting a model to each
Parametric vs. Nonparametric Classification sample, which provides us with additional information
• Parametric: models that first make an assumption Predicted Yes Predicted No
compared to fitting the model once, such as obtaining a
about a function form, or shape, of f (linear). Then Actual Yes True Positives (TP) False Negatives (FN) better estimate for the test error.
fits the model. This reduces estimating f to just Actual No False Positives (FP) True Negatives (TN)
estimating set of parameters, but if our assumption Key Concepts
Accuracy: ratio of correct predictions over total pre-
was wrong, will lead to bad results. Training Data: data used to fit your models or the set
dictions. Misleading when class sizes are substantially
• Non-Parametric: models that don’t make any as- used for learning
different. accuracy = T P +TT N P +T N
+F N +F P
sumptions about f , which allows them to fit a wider Validation Data: data used to tune the parameters of
Precision: how often the classifier is correct when it
range of shapes; but may lead to overfitting a model
predicts positive: precision = T PT+F P
P
Supervised vs. Unsupervised Test Data: data used to evaluate how good your model
Recall: how often the classifier is correct for all positive
• Supervised: models that fit input variables xi = is. Ideally your model should never touch this data until
instances: recall = T PT+F P
N
(x1 , x2 , ...xn ) to a known output variables yi = final testing/evaluation
F-Score: single measurement to describe performance:
(y1 , y2 , ...yn ) precision·recall
F = 2 · precision + recall
• Unsupervised: models that take in input variables Cross Validation
ROC Curves: plots true positive rates and false pos-
xi = (x1 , x2 , ...xn ), but they do not have an asso- Class of methods that estimate test error by holding out
itive rates for various thresholds, or where the model
ciated output to supervise the training. The goal a subset of training data from the fitting process.
determines if a data point is positive or negative (e.g. if
is understand relationships between the variables or Validation Set: split data into training set and valida-
>0.8, classify as positive). Best possible area under the
observations. tion set. Train model on training and estimate test error
ROC curve (AUC) is 1, while random is 0.5, or the main
Blackbox vs. Descriptive using validation. e.g. 80-20 split
diagonal line.
• Blackbox: models that make decisions, but we do Leave-One-Out CV (LOOCV): split data into
not know what happens ”under the hood” e.g. deep training set and validation set, but the validation set
Regression
learning, neural networks consists of 1 observation. Then repeat n-1 times until all
Errors are defined as the difference between a prediction
• Descriptive: models that provide insight into why observations have been used as validation. Test erro is
y0 and the actual result y.
they make their decisions e.g. linear regression, de- the average of these n test error estimates.
Absolute Error: ∆ = y0 − y
cision trees k-Fold CV: randomly divide data into k groups (folds) of
Squared Error: ∆2 = (y0 − y)2
First-Principle vs. Data-Driven approximately equal size. First fold is used as validation
Mean-Squared Error: M SE = n1 n − yi )2
P
• First-Principle: models based on a prior belief of i=1 (y0
√i and the rest as training. Then repeat k times and find
how the system under investigation works, incorpo- Root Mean-Squared Error: RMSD = M SE average of the k estimates.
rates domain knowledge (ad-hoc) Absolute Error Distribution: Plot absolute error dis-
• Data-Driven: models based on observed correla- tribution: should be symmetric, centered around 0, bell- Bootstrapping
tions between input and output variables shaped, and contain rare extreme outliers. Methods that rely on random sampling with replacement.
Deterministic vs. Stochastic Bootstrapping helps with quantifying uncertainty associ-
• Deterministic: models that produce a single ”pre- ated with a given estimate or model.
diction” e.g. yes or no, true or false
• Stochastic: models that produce probability distri- Amplifying Small Data Sets
butions over possible events What can we do it we don’t have enough data?
Flat vs. Hierarchical • Create Negative Examples- e.g. classifying pres-
• Flat: models that solve problems on a single level, idential candidates, most people would be unquali-
no notion of subproblems fied so label most as unqualified
• Hierarchical: models that solve several different • Synthetic Data- create additional data by adding
nested subproblems noise to the real data
Linear Regression Linear Regression II Logistic Regression
Linear regression is a simple and useful tool for predicting Improving Linear Regression Logistic regression is used for classification, where the
a quantitative response. The relationship between input Subset/Feature Selection: approach involves identify- response variable is categorical rather than numerical.
variables X = (X1 , X2 , ...Xp ) and output variable Y takes ing a subset of the p predictors that we believe to be best
the form: related to the response. Then we fit model using the re- The model works by predicting the probability that Y be-
duced set of variables. longs to a particular category by first fitting the data to a
Y ≈ β0 + β1 X1 + ... + βp Xp +  • Best, Forward, and Backward Subset Selection linear regression model, which is then passed to the logis-
Shrinkage/Regularization: all variables are used, but tic function (below). The logistic function will always pro-
β0 ...βp are the unknown coefficients (parameters) which
estimated coefficients are shrunken towards zero relative duce a S-shaped curve, so regardless of X, we can always
we are trying to determine. The best coefficients
to the least squares estimate. λ represents the tuning obtain a sensible answer (between 0 and 1). If the prob-
will lead us to the best ”fit”, which can be found by
parameter- as λ increases, flexibility decreases → de- ability is above a certain predetermined threshold (e.g.
minimizing the residual sum squares (RSS), or the
creased variance but increased bias. The tuning parameter P(Yes) > 0.5), then the model will predict Yes.
sum of the differences betweenPthe actual ith value and
is key in determining the sweet spot between under and
the predicted ith value. RSS = n i=1 ei , where ei = yi − y
ˆi p(X) = eβ0 +β1 X1 +...+βp Xp
over-fitting. In addition, while Ridge will always produce 1+eβ0 +β1 X1 +...+βp Xp
a model with p variables, Lasso can force coefficients to
How to find best fit? How to find best coefficients?
be equal to zero.
Matrix Form: We can solve the closed-form equation for Maximum Likelihood: The coefficients β0 ...βp are un-
• Lasso (L1): min RSS + λ pj=1 |βj |
P
coefficient vector w: w = (X T X)−1 X T Y . X represents known and must be estimated from the training data. We
• Ridge (L2): min RSS + λ pj=1 βj2
P
the input data and Y represents the output data. This seek estimates for β0 ...βp such that the predicted proba-
method is used for smaller matrices, since inverting a Dimension Reduction: projecting p predictors into a bility p̂(xi ) of each observation is a number close to one if
matrix is computationally expensive. M-dimensional subspace, where M < p. This is achieved its observed in a certain class and close to zero otherwise.
Gradient Descent: First-order optimization algorithm. by computing M different linear combinations of the This is done by maximizing the likelihood function:
We can find the minimum of a convex function by variables. Can use PCA. Y Y
starting at an arbitrary point and repeatedly take steps Miscellaneous: Removing outliers, feature scaling, l(β0 , β1 ) = p(xi ) (1 − p(xi ))
in the downward direction, which can be found by taking removing multicollinearity (correlated variables) i:yi =1 i0 :yi0 =1

the negative direction of the gradient. After several Potential Issues


iterations, we will eventually converge to the minimum. Evaluating Model Accuracy q Imbalanced Classes: imbalance in classes in training
1
In our case, the minimum corresponds to the coefficients Residual Standard Error (RSE): RSE = n−2
RSS. data lead to poor classifiers. It can result in a lot of false
with the minimum error, or the best line of fit. The Generally, the smaller the better. positives and also lead to few training data. Solutions in-
learning rate α determines the size of the steps we take R2 : Measure of fit that represents the proportion of clude forcing balanced data by removing observations from
in the downward direction. variance explained, or the variability in Y that can be the larger class, replicate data from the smaller class, or
explained using X. It takes on a value between 0 and 1. heavily weigh the training examples toward instances of
Gradient descent algorithm in two dimensions. Repeat Generally the higher the better. R2 = 1 − RSS T SS
, where the larger class.
until convergence. Total Sum of Squares (TSS) =
P
(yi − ȳ)2 Multi-Class Classification: the more classes you try to
1. w0t+1 := w0t − α ∂w

0
J(w0 , w1 ) predict, the harder it will be for the the classifier to be ef-
t+1 t ∂
2. w1 := w1 − α ∂w1 J(w0 , w1 ) Evaluating Coefficient Estimates fective. It is possible with logistic regression, but another
Standard Error (SE) of the coefficients can be used to per- approach, such as Linear Discriminant Analysis (LDA),
For non-convex functions, gradient descent no longer guar- form hypothesis tests on the coefficients: may prove better.
antees an optimal solutions since there may be local min- H0 : No relationship between X and Y, Ha : Some rela-
imas. Instead, we should run the algorithm from different tionship exists. A p-value can be obtained and can be
starting points and use the best local minima we find for interpreted as follows: a small p-value indicates that a re-
the solution. lationship between the predictor (X) and the response (Y)
Stochastic Gradient Descent: instead of taking a step exists. Typical p-value cutoffs are around 5 or 1 %.
after sampling the entire training set, we take a small
batch of training data at random to determine our next
step. Computationally more efficient and may lead to
faster convergence.
Distance/Network Methods Nearest Neighbor Classification Clustering
Interpreting examples as points in space provides a way Distance functions allow us to identify the points closest Clustering is the problem of grouping points by sim-
to find natural groupings or clusters among data e.g. to a given target, or the nearest neighbors (NN) to a ilarity using distance metrics, which ideally reflect the
which stars are the closest to our sun? Networks can also given point. The advantages of NN include simplicity, similarities you are looking for. Often items come from
be built from point sets (vertices) by connecting related interpretability and non-linearity. logical ”sources” and clustering is a good way to reveal
points. those origins. Perhaps the first thing to do with any
k-Nearest Neighbors data set. Possible applications include: hypothesis
Measuring Distances/Similarity Measure Given a positive integer k and a point x0 , the KNN development, modeling over smaller subsets of data, data
There are several ways of measuring distances between classifier first identifies k points in the training data reduction, outlier detection.
points a and b in d dimensions- with closer distances most similar to x0 , then estimates the conditional
implying similarity. probability of x0 being in class j as the fraction of K-Means Clustering
qP the k points whose values belong to j. The opti- Simple and elegant algorithm to partition a dataset into
Minkowski Distance Metric: dk (a, b) = k d mal value for k can be found using cross validation. K distinct, non-overlapping clusters.
i=1 |ai − bi |
k

The parameter k provides a way to tradeoff between the 1. Choose a K. Randomly assign a number between 1
largest and the total dimensional difference. In other and K to each observation. These serve as initial
words, larger values of k place more emphasis on large cluster assignments
differences between feature values than smaller values. Se- 2. Iterate until cluster assignments stop changing
lecting the right k can significantly impact the the mean- (a) For each of the K clusters, compute the cluster
ingfulness of your distance function. The most popular centroid. The kth cluster centroid is the vector
values are 1 and 2. of the p feature means for the observations in
• Manhattan (k=1): city block distance, or the sum the kth cluster.
KNN Algorithm (b) Assign each observation to the cluster whose
of the absolute difference between two points
1. Compute distance D(a,b) from point b to all points centroid is closest (where closest is defined us-
• Euclidean (k=2): straight line distance
2. Select k closest points and their labels ing distance metric).
3. Output class with most frequent labels in k points Since the results of the algorithm depends on the initial
Optimizing KNN random assignments, it is a good idea to repeat the
Comparing a query point a in d dimensions against n train- algorithm from different random initializations to obtain
ing examples computes with a runtime of O(nd), which the best overall results. Can use MSE to determine which
can cause lag as points reach millions or billions. Popular cluster assignment is better.
choices to speed up KNN include:
• Vernoi Diagrams: partitioning plane into regions
qP
Weighted Minkowski: dk (a, b) = k d Hierarchical Clustering
i=1 wi |ai − bi | , in
k
based on distance to points in a specific subset of Alternative clustering algorithm that does not require us
some scenarios, not all dimensions are equal. Can convey the plane to commit to a particular K. Another advantage is that it
this idea using wi . Generally not a good idea- should • Grid Indexes: carve up space into d-dimensional results in a nice visualization called a dendrogram. Ob-
normalize data by Z-scores before computing distances. boxes or grids and calculate the NN in the same cell servations that fuse at bottom are similar, where those at
a·b as the point the top are quite different- we draw conclusions based on
Cosine Similarity: cos(a, b) = |a||b| , calculates the • Locality Sensitive Hashing (LSH): abandons the location on the vertical rather than horizontal axis.
similarity between 2 non-zero vectors, where a · b is the the idea of finding the exact nearest neighbors. In- 1. Begin with n observations and a measure of all the
dot product (normalized between 0 and 1), higher values stead, batch up nearby points to quickly find the (n)n−1
2
pairwise dissimilarities. Treat each observa-
imply more similar vectors most appropriate bucket B for our query point. LSH tion as its own cluster.
is defined by a hash function h(p) that takes a 2. For i = n, n-1, ...2
Kullback-Leibler Divergence: KL(A||B) = di=i ai log2 abii
P
point/vector as input and produces a number/ code (a) Examine all pairwise inter-cluster dissimilari-
KL divergence measures the distances between probabil- as output, such that it is likely that h(a) = h(b) if ties among the i clusters and identify the pair
ity distributions by measuring the uncertainty gained or a and b are close to each other, and h(a)!= h(b) if of clusters that are least dissimilar ( most simi-
uncertainty lost when replacing distribution A with dis- they are far apart. lar). Fuse these two clusters. The dissimilarity
tribution B. However, this is not a metric but forms the
between these two clusters indicates height in
basis for the Jensen-Shannon Divergence Metric.
dendrogram where fusion should be placed.
Jensen-Shannon: JS(A, B) = 12 KL(A||M )+ 21 KL(M ||B),
(b) Assign each observation to the cluster whose
where M is the average of A and B. The JS function is the
centroid is closest (where closest is defined us-
right metric for calculating distances between probability
ing distance metric).
distributions
Linkage: Complete (max dissimilarity), Single (min), Av-
erage, Centroid (between centroids of cluster A and B)
Machine Learning Part I Machine Learning Part II Machine Learning Part III
Comparing ML Algorithms Decision Trees Support Vector Machines
Power and Expressibility: ML methods differ in terms Binary branching structure used to classify an arbitrary Work by constructing a hyperplane that separates
of complexity. Linear regression fits linear functions while input vector X. Each node in the tree contains a sim- points between two classes. The hyperplane is de-
NN define piecewise-linear separation boundaries. More ple feature comparison against some field (xi > 42?). termined using the maximal margin hyperplane, which
complex models can provide more accurate models, but Result of each comparison is either true or false, which is the hyperplane that is the maximum distance from
at the risk of overfitting. determines if we should proceed along to the left or the training observations. This distance is called
Interpretability: some models are more transparent right child of the given node. Also known as some- the margin. Points that fall on one side of the
and understandable than others (white box vs. black box times called classification and regression trees (CART). hyperplane are classified as -1 and the other +1.
models)
Ease of Use: some models feature few parame-
ters/decisions (linear regression/NN), while others
require more decision making to optimize (SVMs)
Training Speed: models differ in how fast they fit the
necessary parameters
Prediction Speed: models differ in how fast they make
predictions given a query
Advantages: Non-linearity, support for categorical
variables, easy to interpret, application to regression. Principal Component Analysis (PCA)
Disadvantages: Prone to overfitting, instable (not Principal components allow us to summarize a set of
robust to noise), high variance, low bias correlated variables with a smaller set of variables that
collectively explain most of the variability in the original
Note: rarely do models just use one decision tree. set. Essentially, we are ”dropping” the least important
Instead, we aggregate many decision trees using methods feature variables.
like ensembling, bagging, and boosting.
Naive Bayes Principal Component Analysis is the process by
Naive Bayes methods are a set of supervised learning Ensembles, Bagging, Random Forests, Boosting which principal components are calculated and the use
algorithms based on applying Bayes’ theorem with the Ensemble learning is the strategy of combining many of them to analyzing and understanding the data. PCA
”naive” assumption of independence between every pair different classifiers/models into one predictive model. It is an unsupervised approach and is used for dimensional-
of features. revolves around the idea of voting: a so-called ”wisdom of ity reduction, feature extraction, and data visualization.
crowds” approach. The most predicted class will be the Variables after performing PCA are independent. Scal-
Problem: Suppose we need to classify vector X = x1 ...xn final prediction. ing variables is also important while performing PCA.
into m classes, C1 ...Cm . We need to compute the proba- Bagging: ensemble method that works by taking B boot-
bility of each possible class given X, so we can assign X strapped subsamples of the training data and constructing
the label of the class with highest probability. We can B trees, each tree training on a distinct subsample as
calculate a probability using the Bayes’ Theorem: Random Forests: builds on bagging by decorrelating
P (X|Ci )P (Ci ) the trees. We do everything the same like in bagging, but
P (Ci |X) = when we build the trees, everytime we consider a split, a
P (X)
random sample of the p predictors is chosen as split can-
Where: √
didates, not the full set (typically m ≈ p). When m =
1. P (Ci ): the prior probability of belonging to class i p, then we are just doing bagging.
2. P (X): normalizing constant, or probability of seeing Boosting: the main idea is to improve our model where
the given input vector over all possible input vectors it is not performing well by using information from previ-
3. P (X|Ci ): the conditional probability of seeing ously constructed classifiers. Slow learner. Has 3 tuning
input vector X given we know the class is Ci parameters: number of classifiers B, learning parameter λ,
interaction depth d (controls interaction order of model).
The prediction model will formally look like:
P (X|Ci )P (Ci )
C(X) = argmaxi∈classes(t) P (X)

where C(X) is the prediction returned for input X.


Machine Learning Part IV Deep Learning Part I Deep Learning Part II
ML Terminology and Concepts What is Deep Learning? Tensorflow
Deep learning is a subset of machine learning. One popu- Tensorflow is an open source software library for numeri-
Features: input data/variables used by the ML model lar DL technique is based on Neural Networks (NN), which cal computation using data flow graphs. Everything in
Feature Engineering: transforming input features to loosely mimic the human brain and the code structures TF is a graph, where nodes represent operations on data
be more useful for the models. e.g. mapping categories to are arranged in layers. Each layer’s input is the previous and edges represent the data. Phase 1 of TF is building
buckets, normalizing between -1 and 1, removing null layer’s output, which yields progressively higher-level fea- up a computation graph and phase 2 is executing it. It is
Train/Eval/Test: training is data used to optimize the tures and defines a hierarchy. A Deep Neural Network is also distributed, meaning it can run on either a cluster of
model, evaluation is used to asses the model on new data just a NN that has more than 1 hidden layer. machines or just a single machine.
during training, test is used to provide the final result TF is extremely popular/suitable for working with Neural
Classification/Regression: regression is prediction a Networks, since the way TF sets up the computational
number (e.g. housing price), classification is prediction graph pretty much resembles a NN.
from a set of categories(e.g. predicting red/blue/green)
Linear Regression: predicts an output by multiplying
and summing input features with weights and biases
Logistic Regression: similar to linear regression but
predicts a probability
Overfitting: model performs great on the input data but
poorly on the test data (combat by dropout, early stop-
ping, or reduce # of nodes or layers)
Bias/Variance: how much output is determined by the Recall that statistical learning is all about approximating
features. more variance often can mean overfitting, more f (X). Neural networks are known as universal approx-
bias can mean a bad model imators, meaning no matter how complex a function is,
Regularization: variety of approaches to reduce over- there exists a NN that can (approximately) do the job.
fitting, including adding the weights to the loss function, We can increase the approximation (or complexity) by
randomly dropping layers (dropout) adding more hidden layers and neurons.
Ensemble Learning: training multiple models with dif- Tensors
ferent parameters to solve the same problem Popular Architectures In a graph, tensors are the edges and are multidimensional
A/B testing: statistical way of comparing 2+ techniques There are different kinds of NNs that are suitable for data arrays that flow through the graph. Central unit
to determine which technique performs better and also if certain problems, which depend on the NN’s architecture. of data in TF and consists of a set of primitive values
difference is statistically significant shaped into an array of any number of dimensions.
Baseline Model: simple model/heuristic used as refer- Linear Classifier: takes input features and combines A tensor is characterized by its rank (# dimensions
ence point for comparing how well a model is performing them with weights and biases to predict output value in tensor), shape (# of dimensions and size of each di-
Bias: prejudice or favoritism towards some things, people, DNN: deep neural net, contains intermediate layers of mension), data type (data type of each element in tensor).
or groups over others that can affect collection/sampling nodes that represent “hidden features” and activation
and interpretation of data, the design of a system, and functions to represent non-linearity Placeholders and Variables
how users interact with a system CNN: convolutional NN, has a combination of convolu- Variables: best way to represent shared, persistent state
Dynamic Model: model that is trained online in a con- tional, pooling, dense layers. popular for image classifica- manipulated by your program. These are the parameters
tinuously updating fashion tion. of the ML model are altered/trained during the training
Static Model: model that is trained offline Transfer Learning: use existing trained models as start- process. Training variables.
Normalization: process of converting an actual range of ing points and add additional layers for the specific use Placeholders: way to specify inputs into a graph that
values into a standard range of values, typically -1 to +1 case. idea is that highly trained existing models know hold the place for a Tensor that will be fed at runtime.
Independently and Identically Distributed (i.i.d): general features that serve as a good starting point for They are assigned once, do not change after. Input nodes
data drawn from a distribution that doesn’t change, and training a small network on specific examples
where each value drawn doesn’t depend on previously RNN: recurrent NN, designed for handling a sequence of
drawn values; ideal but rarely found in real life inputs that have ”memory” of the sequence. LSTMs are
Hyperparameters: the ”knobs” that you tweak during a fancy version of RNNs, popular for NLP
successive runs of training a model GAN: general adversarial NN, one model creates fake ex-
Generalization: refers to a model’s ability to make cor- amples, and another model is served both fake example
rect predictions on new, previously unseen data as op- and real examples and is asked to distinguish
posed to the data used to train the model Wide and Deep: combines linear classifiers with deep
Cross-Entropy: quantifies the difference between two neural net classifiers, ”wide” linear parts represent mem-
probability distributions orizing specific examples and “deep” parts represent un-
derstanding high level features
Deep Learning Part III Big Data- Hadoop Overview Big Data- Hadoop Ecosystem
Deep Learning Terminology and Concepts Data can no longer fit in memory on one machine An entire ecosystem of tools have emerged around
(monolithic), so a new way of computing was devised Hadoop, which are based on interacting with HDFS.
Neuron: node in a NN, typically taking in multiple in- using a group of computers to process this ”big data” Below are some popular ones:
put values and generating one output value, calculates the (distributed). Such a group is called a cluster, which
output value by applying an activation function (nonlin- makes up server farms. All of these servers have to be Hive: data warehouse software built o top of Hadoop that
ear transformation) to a weighted sum of input values coordinated in the following ways: partition data, coor- facilitates reading, writing, and managing large datasets
Weights: edges in a NN, the goal of training is to deter- dinate computing tasks, handle fault tolerance/recovery, residing in distributed storage using SQL-like queries
mine the optimal weight for each feature; if weight = 0, and allocate capacity to process. (HiveQL). Hive abstracts away underlying MapReduce
corresponding feature does not contribute jobs and returns HDFS in the form of tables (not HDFS).
Neural Network: composed of neurons (simple building Hadoop Pig: high level scripting language (Pig Latin) that
blocks that actually “learn”), contains activation functions Hadoop is an open source distributed processing frame- enables writing complex data transformations. It pulls
that makes it possible to predict non-linear outputs work that manages data processing and storage for big unstructured/incomplete data from sources, cleans it, and
Activation Functions: mathematical functions that in- data applications running in clustered systems. It is com- places it in a database/data warehouses. Pig performs
troduce non-linearity to a network e.g. RELU, tanh prised of 3 main components: ETL into data warehouse while Hive queries from data
Sigmoid Function: function that maps very negative • Hadoop Distributed File System (HDFS): warehouse to perform analysis (GCP: DataFlow).
numbers to a number very close to 0, huge numbers close a distributed file system that provides high- Spark: framework for writing fast, distributed programs
to 1, and 0 to .5. Useful for predicting probabilities throughput access to application data by partition- for data processing and analysis. Spark solves similar
Gradient Descent/Backpropagation: fundamental ing data across many machines problems as Hadoop MapReduce but with a fast in-
loss optimizer algorithms, of which the other optimizers • YARN: framework for job scheduling and cluster memory approach. It is an unified engine that supports
are usually based. Backpropagation is similar to gradient resource management (task coordination) SQL queries, streaming data, machine learning and
descent but for neural nets • MapReduce: YARN-based system for parallel graph processing. Can operate separately from Hadoop
Optimizer: operation that changes the weights and bi- processing of large data sets on multiple machines but integrates well with Hadoop. Data is processed
ases to reduce loss e.g. Adagrad or Adam using Resilient Distributed Datasets (RDDs), which are
Weights / Biases: weights are values that the input fea- HDFS immutable, lazily evaluated, and tracks lineage. ;
tures are multiplied by to predict an output value. Biases Each disk on a different machine in a cluster is comprised Hbase: non-relational, NoSQL, column-oriented
;
are the value of the output given a weight of 0. of 1 master node and the rest are workers/data nodes. database management system that runs on top of
Converge: algorithm that converges will eventually reach The master node manages the overall file system by HDFS. Well suited for sparse data sets (GCP: BigTable)
an optimal answer, even if very slowly. An algorithm that storing the directory structure and the metadata of the Flink/Kafka: stream processing framework. Batch
doesn’t converge may never reach an optimal answer. files. The data nodes physically store the data. Large streaming is for bounded, finite datasets, with periodic
Learning Rate: rate at which optimizers change weights files are broken up and distributed across multiple ma- updates, and delayed processing. Stream processing
and biases. High learning rate generally trains faster but chines, which are also replicated across multiple machines is for unbounded datasets, with continuous updates,
risks not converging, whereas a lower rate trains slower to provide fault tolerance. and immediate processing. Stream data and stream
Numerical Instability: issues with very large/small val- processing must be decoupled via a message queue.
ues due to limits of floating point numbers in computers MapReduce Can group streaming data (windows) using tumbling
Embeddings: mapping from discrete objects, such as Parallel programming paradigm which allows for process- (non-overlapping time), sliding (overlapping time), or
words, to vectors of real numbers. useful because classi- ing of huge amounts of data by running processes on mul- session (session gap) windows.
fiers/neural networks work well on vectors of real numbers tiple machines. Defining a MapReduce job requires two Beam: programming model to define and execute data
Convolutional Layer: series of convolutional opera- stages: map and reduce. processing pipelines, including ETL, batch and stream
tions, each acting on a different slice of the input matrix • Map: operation to be performed in parallel on small (continuous) processing. After building the pipeline,
Dropout: method for regularization in training NNs, portions of the dataset. the output is a key-value it is executed by one of Beam’s distributed processing
works by removing a random selection of some units in pair < K, V > back-ends (Apache Apex, Apache Flink, Apache Spark,
a network layer for a single gradient step • Reduce: operation to combine the results of Map and Google Cloud Dataflow). Modeled as a Directed
Early Stopping: method for regularization that involves Acyclic Graph (DAG).
ending model training early YARN- Yet Another Resource Negotiator Oozie: workflow scheduler system to manage Hadoop
Gradient Descent: technique to minimize loss by com- Coordinates tasks running on the cluster and assigns new jobs
puting the gradients of loss with respect to the model’s nodes in case of failure. Comprised of 2 subcomponents: Sqoop: transferring framework to transfer large amounts
parameters, conditioned on training data the resource manager and the node manager. The re- of data into HDFS from relational databases (MySQL)
Pooling: Reducing a matrix (or matrices) created by an source manager runs on a single master node and sched-
earlier convolutional layer to a smaller matrix. Pooling ules tasks across nodes. The node manager runs on all
usually involves taking either the maximum or average other nodes and manages tasks on the individual node.
value across the pooled area
SQL Part I Python- Data Structures Recommended Resources
Structured Query Language (SQL) is a declarative Data structures are a way of storing and manipulating • Data Science Design Manual
language used to access & manipulate data in databases. data and each data structure has its own strengths and (www.springer.com/us/book/9783319554433)
Usually the database is a Relational Database Man- weaknesses. Combined with algorithms, data structures • Introduction to Statistical Learning
agement System (RDBMS), which stores data arranged allow us to efficiently solve problems. It is important to (www-bcf.usc.edu/~gareth/ISL/)
in relational database tables. A table is arranged in know the main types of data structures that you will need • Probability Cheatsheet
columns and rows, where columns represent character- to efficiently solve problems. (/www.wzchen.com/probability-cheatsheet/)
istics of stored data and rows represent actual data entries. • Google’s Machine Learning Crash Course
Lists: or arrays, ordered sequences of objects, mutable (developers.google.com/machine-learning/
Basic Queries crash-course/)
>>> l = [42, 3.14, "hello","world"]
- filter columns: SELECT col1, col3... FROM table1
- filter the rows: WHERE col4 = 1 AND col5 = 2 Tuples: like lists, but immutable
- aggregate the data: GROUP BY. . .
- limit aggregated data: HAVING count(*) > 1 >>> t = (42, 3.14, "hello","world")
- order of the results: ORDER BY col2 Dictionaries: hash tables, key-value pairs, unsorted

Useful Keywords for SELECT >>> d = {"life": 42, "pi": 3.14}


DISTINCT- return unique results Sets: mutable, unordered sequence of unique elements.
BETWEEN a AND b- limit the range, the values can frozensets are just immutable sets
be numbers, text, or dates
LIKE- pattern search within the column text >>> s = set([42, 3.14, "hello","world"])
IN (a, b, c) - check if the value is contained among given Collections Module
deque: double-ended queue, generalization of stacks and
Data Modification queues; supports append, appendLeft, pop, rotate, etc
- update specific data with the WHERE clause:
UPDATE table1 SET col1 = 1 WHERE col2 = 2 >>> s = deque([42, 3.14, "hello","world"])
- insert values manually
Counter: dict subclass, unordered collection where ele-
INSERT INTO table1 (col1,col3) VALUES (val1,val3);
ments are stored as keys and counts stored as values
- by using the results of a query
INSERT INTO table1 (col1,col3) SELECT col,col2 >>> c = Counter('apple')
FROM table2; >>> print(c)
Counter({'p': 2, 'a': 1, 'l': 1, 'e': 1})
Joins
heqpq Module
The JOIN clause is used to combine rows from two or more
Heap Queue: priority queue, heaps are binary trees for
tables, based on a related column between them.
which every parent node has a value greater than or equal
to any of its children (max-heap), order is important; sup-
ports push, pop, pushpop, heapify, replace functionality
>>> heap = []
>>> for n in data:
... heappush(heap, n)
>>> heap
[0, 1, 3, 6, 2, 8, 4, 7, 9, 5]
Python for Data Science Cheat Sheet Spans Syntax iterators
Learn more Python for data science interactively at www.datacamp.com
Accessing spans Sentences USUALLY NEEDS THE DEPENDENCY PARSER

Span indices are exclusive. So doc[2:4] is a span starting at doc = nlp("This a sentence. This is another one.")
token 2, up to – but not including! – token 4.
About spaCy # doc.sents is a generator that yields sentence spans
[sent.text for sent in doc.sents]
spaCy is a free, open-source library for advanced Natural doc = nlp("This is a text") # ['This is a sentence.', 'This is another one.']
Language Processing (NLP) in Python. It's designed span = doc[2:4]
specifically for production use and helps you build span.text
applications that process and "understand" large volumes # 'a text'
Base noun phrases NEEDS THE TAGGER AND PARSER
of text. Documentation: spacy.io
doc = nlp("I have a red car")
Creating a span manually
# doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]
$ pip install spacy # Import the Span object
# ['I', 'a red car']
from spacy.tokens import Span
# Create a Doc object
import spacy
doc = nlp("I live in New York")
# Span for "New York" with label GPE (geopolitical)
span = Span(doc, 3, 5, label="GPE") Label explanations
Statistical models span.text
# 'New York' spacy.explain("RB")
# 'adverb'
Download statistical models spacy.explain("GPE")
Predict part-of-speech tags, dependency labels, named # 'Countries, cities, states'
entities and more. See here for available models: Linguistic features
spacy.io/models
Attributes return label IDs. For string labels, use the
$ python -m spacy download en_core_web_sm attributes with an underscore. For example, token.pos_ . Visualizing
Check that your installed models are up to date If you're in a Jupyter notebook, use displacy.render .
Part-of-speech tags PREDICTED BY STATISTICAL MODEL Otherwise, use displacy.serve to start a web server and
$ python -m spacy validate show the visualization in your browser.
doc = nlp("This is a text.")
Loading statistical models # Coarse-grained part-of-speech tags from spacy import displacy
[token.pos_ for token in doc]
import spacy # ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']
# Load the installed model "en_core_web_sm" # Fine-grained part-of-speech tags Visualize dependencies
nlp = spacy.load("en_core_web_sm") [token.tag_ for token in doc]
# ['DT', 'VBZ', 'DT', 'NN', '.'] doc = nlp("This is a sentence")
displacy.render(doc, style="dep")

Documents and tokens Syntactic dependencies PREDICTED BY STATISTICAL MODEL

doc = nlp("This is a text.")


Processing text # Dependency labels
Processing text with the nlp object returns a Doc object [token.dep_ for token in doc]
that holds all information about the tokens, their linguistic # ['nsubj', 'ROOT', 'det', 'attr', 'punct']
features and their relationships # Syntactic head token (governor)
[token.head.text for token in doc]
# ['is', 'is', 'text', 'is', 'is']
doc = nlp("This is a text")
Visualize named entities
Accessing token attributes Named entities PREDICTED BY STATISTICAL MODEL
doc = nlp("Larry Page founded Google")
doc = nlp("This is a text") doc = nlp("Larry Page founded Google") displacy.render(doc, style="ent")
# Token texts # Text and label of named entity span
[token.text for token in doc] [(ent.text, ent.label_) for ent in doc.ents]
# ['This', 'is', 'a', 'text'] # [('Larry Page', 'PERSON'), ('Google', 'ORG')]
Word vectors and similarity Extension attributes Rule-based matching
To use word vectors, you need to install the larger models Custom attributes that are registered on the global Doc ,
ending in md or lg , for example en_core_web_lg . Token and Span classes and become available as ._ . Token patterns
# "love cats", "loving cats", "loved cats"
Comparing similarity
from spacy.tokens import Doc, Token, Span pattern1 = [{"LEMMA": "love"}, {"LOWER": "cats"}]
doc1 = nlp("I like cats") doc = nlp("The sky over New York is blue") # "10 people", "twenty people"
doc2 = nlp("I like dogs") pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]
# Compare 2 documents # "book", "a cat", "the sea" (noun + optional article)
doc1.similarity(doc2) Attribute extensions WITH DEFAULT VALUE pattern3 = [{"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]
# Compare 2 tokens
# Register custom attribute on Token class
doc1[2].similarity(doc2[2])
# Compare tokens and spans
Token.set_extension("is_color", default=False) Operators and quantifiers
# Overwrite extension attribute with default value
doc1[0].similarity(doc2[1:3])
doc[6]._.is_color = True Can be added to a token dict as the "OP" key.

Accessing word vectors ! Negate pattern and match exactly 0 times.


Property extensions WITH GETTER & SETTER
? Make pattern optional and match 0 or 1 times.
# Vector as a numpy array # Register custom attribute on Doc class
doc = nlp("I like cats") get_reversed = lambda doc: doc.text[::-1] + Require pattern to match 1 or more times.
# The L2 norm of the token's vector Doc.set_extension("reversed", getter=get_reversed)
doc[2].vector * Allow pattern to match 0 or more times.
# Compute value of extension attribute with getter
doc[2].vector_norm doc._.reversed
# 'eulb si kroY weN revo yks ehT'
Glossary
Method extensions CALLABLE METHOD Tokenization Segmenting text into words, punctuation etc.
Pipeline components
Lemmatization Assigning the base forms of words, for example:
Functions that take a Doc object, modify it and return it. # Register custom attribute on Span class
"was" → "be" or "rats" → "rat".
has_label = lambda span, label: span.label_ == label
Span.set_extension("has_label", method=has_label) Sentence Boundary Finding and segmenting individual sentences.
nlp
# Compute value of extension attribute with method Detection

Text tokenizer tagger parser ner ... Doc doc[3:5].has_label("GPE") Part-of-speech (POS) Assigning word types to tokens like verb or noun.
# True Tagging

Dependency Parsing Assigning syntactic dependency labels,


describing the relations between individual
Pipeline information tokens, like subject or object.
Rule-based matching
nlp = spacy.load("en_core_web_sm") Named Entity Labeling named "real-world" objects, like
nlp.pipe_names Recognition (NER) persons, companies or locations.
# ['tagger', 'parser', 'ner'] Using the matcher Text Classification Assigning categories or labels to a whole
nlp.pipeline
document, or parts of a document.
# [('tagger', <spacy.pipeline.Tagger>), # Matcher is initialized with the shared vocab
# ('parser', <spacy.pipeline.DependencyParser>), from spacy.matcher import Matcher Statistical model Process for making predictions based on
# ('ner', <spacy.pipeline.EntityRecognizer>)] # Each dict represents one token and its attributes examples.
matcher = Matcher(nlp.vocab) Training Updating a statistical model with new examples.
# Add with ID, optional callback and pattern(s)
Custom components
pattern = [{"LOWER": "new"}, {"LOWER": "york"}]
# Function that modifies the doc and returns it matcher.add("CITIES", None, pattern)
def custom_component(doc): # Match by calling the matcher on a Doc object
print("Do something to the doc here!") doc = nlp("I live in New York")
return doc matches = matcher(doc)
# Add the component first in the pipeline # Matches are (match_id, start, end) tuples
nlp.add_pipe(custom_component, first=True) for match_id, start, end in matches: Learn Python for
# Get the matched span by slicing the Doc data science interactively at
span = doc[start:end]
Components can be added first , last (default), or www.datacamp.com
print(span.text)
before or after an existing component. # 'New York'
Python For Data Science Cheat Sheet Excel Spreadsheets Pickled Files
>>> file = 'urbanpop.xlsx' >>> import pickle
Importing Data >>> data = pd.ExcelFile(file) >>> with open('pickled_fruit.pkl', 'rb') as file:
pickled_data = pickle.load(file)
>>> df_sheet2 = data.parse('1960-1966',
Learn Python for data science Interactively at www.DataCamp.com skiprows=[0],
names=['Country',
'AAM: War(2002)'])
>>> df_sheet1 = data.parse(0, HDF5 Files
parse_cols=[0],
Importing Data in Python skiprows=[0], >>> import h5py
>>> filename = 'H-H1_LOSC_4_v1-815411200-4096.hdf5'
names=['Country'])
Most of the time, you’ll use either NumPy or pandas to import >>> data = h5py.File(filename, 'r')
your data: To access the sheet names, use the sheet_names attribute:
>>> import numpy as np >>> data.sheet_names
>>> import pandas as pd Matlab Files
Help SAS Files >>> import scipy.io
>>> filename = 'workspace.mat'
>>> from sas7bdat import SAS7BDAT >>> mat = scipy.io.loadmat(filename)
>>> np.info(np.ndarray.dtype)
>>> help(pd.read_csv) >>> with SAS7BDAT('urbanpop.sas7bdat') as file:
df_sas = file.to_data_frame()

Text Files Exploring Dictionaries


Stata Files Accessing Elements with Functions
Plain Text Files >>> data = pd.read_stata('urbanpop.dta') >>> print(mat.keys()) Print dictionary keys
>>> filename = 'huck_finn.txt' >>> for key in data.keys(): Print dictionary keys
>>> file = open(filename, mode='r') Open the file for reading print(key)
>>> text = file.read() Read a file’s contents Relational Databases meta
quality
>>> print(file.closed) Check whether file is closed
>>> from sqlalchemy import create_engine strain
>>> file.close() Close file
>>> print(text) >>> engine = create_engine('sqlite://Northwind.sqlite') >>> pickled_data.values() Return dictionary values
>>> print(mat.items()) Returns items in list format of (key, value)
Use the table_names() method to fetch a list of table names: tuple pairs
Using the context manager with
>>> with open('huck_finn.txt', 'r') as file:
>>> table_names = engine.table_names() Accessing Data Items with Keys
print(file.readline()) Read a single line
print(file.readline()) Querying Relational Databases >>> for key in data ['meta'].keys() Explore the HDF5 structure
print(file.readline()) print(key)
>>> con = engine.connect() Description
>>> rs = con.execute("SELECT * FROM Orders") DescriptionURL
Table Data: Flat Files >>> df = pd.DataFrame(rs.fetchall()) Detector
>>> df.columns = rs.keys() Duration
GPSstart
Importing Flat Files with numpy >>> con.close()
Observatory
Files with one data type Using the context manager with Type
UTCstart
>>> filename = ‘mnist.txt’ >>> with engine.connect() as con:
>>> print(data['meta']['Description'].value) Retrieve the value for a key
>>> data = np.loadtxt(filename, rs = con.execute("SELECT OrderID FROM Orders")
delimiter=',', String used to separate values df = pd.DataFrame(rs.fetchmany(size=5))
df.columns = rs.keys()
skiprows=2,
usecols=[0,2],
Skip the first 2 lines
Read the 1st and 3rd column
Navigating Your FileSystem
dtype=str) The type of the resulting array Querying relational databases with pandas
Magic Commands
Files with mixed data types >>> df = pd.read_sql_query("SELECT * FROM Orders", engine)
>>> filename = 'titanic.csv' !ls List directory contents of files and directories
>>> data = np.genfromtxt(filename, %cd .. Change current working directory
%pwd Return the current working directory path
delimiter=',',
names=True, Look for column header
Exploring Your Data
dtype=None)
NumPy Arrays os Library
>>> data_array = np.recfromcsv(filename) >>> data_array.dtype Data type of array elements >>> import os
>>> data_array.shape Array dimensions >>> path = "/usr/tmp"
The default dtype of the np.recfromcsv() function is None. >>> wd = os.getcwd() Store the name of current directory in a string
>>> len(data_array) Length of array
>>> os.listdir(wd) Output contents of the directory in a list
Importing Flat Files with pandas >>> os.chdir(path) Change current working directory
pandas DataFrames >>> os.rename("test1.txt", Rename a file
>>> filename = 'winequality-red.csv' "test2.txt")
>>> data = pd.read_csv(filename, >>> df.head() Return first DataFrame rows
nrows=5, >>> os.remove("test1.txt") Delete an existing file
Number of rows of file to read >>> df.tail() Return last DataFrame rows >>> os.mkdir("newdir") Create a new directory
header=None, Row number to use as col names >>> df.index Describe index
sep='\t', Delimiter to use >>> df.columns Describe DataFrame columns
comment='#', Character to split comments >>> df.info() Info on DataFrame
na_values=[""]) String to recognize as NA/NaN >>> data_array = data.values Convert a DataFrame to an a NumPy array DataCamp
Learn R for Data Science Interactively
R For Data Science Cheat Sheet dplyr ggplot2
Tidyverse for Beginners Filter Scatter plot
Learn More R for Data Science Interactively at www.datacamp.com filter() allows you to select a subset of rows in a data frame. Scatter plots allow you to compare two variables within your data. To do this with
ggplot2, you use geom_point()
> iris %>% Select iris data of species
filter(Species=="virginica") "virginica" > iris_small <- iris %>%
> iris %>% Select iris data of species filter(Sepal.Length > 5)
Tidyverse filter(Species=="virginica", "virginica" and sepal length > ggplot(iris_small, aes(x=Petal.Length,
y=Petal.Width)) +
Compare petal
width and length
Sepal.Length > 6) greater than 6.
The tidyverse is a powerful collection of R packages that are actually geom_point()
data tools for transforming and visualizing data. All packages of the
tidyverse share an underlying philosophy and common APIs. Arrange Additional Aesthetics
arrange() sorts the observations in a dataset in ascending or descending order • Color
The core packages are: based on one of its variables. > ggplot(iris_small, aes(x=Petal.Length,
y=Petal.Width,
• ggplot2, which implements the grammar of graphics. You can use it > iris %>% Sort in ascending order of color=Species)) +
to visualize your data. arrange(Sepal.Length) sepal length geom_point()
> iris %>% Sort in descending order of
• dplyr is a grammar of data manipulation. You can use it to solve the arrange(desc(Sepal.Length)) sepal length • Size
most common data manipulation challenges. > ggplot(iris_small, aes(x=Petal.Length,
Combine multiple dplyr verbs in a row with the pipe operator %>%: y=Petal.Width,
color=Species,
• tidyr helps you to create tidy data or data where each variable is in a > iris %>% Filter for species "virginica"
size=Sepal.Length)) +
column, each observation is a row end each value is a cell. filter(Species=="virginica") %>% then arrange in descending
geom_point()
arrange(desc(Sepal.Length)) order of sepal length
• readr is a fast and friendly way to read rectangular data. Faceting
Mutate > ggplot(iris_small, aes(x=Petal.Length,
y=Petal.Width)) +
• purrr enhances R’s functional programming (FP) toolkit by providing a mutate() allows you to update or create new columns of a data frame.
geom_point()+
complete and consistent set of tools for working with functions and facet_wrap(~Species)
vectors. > iris %>% Change Sepal.Length to be
mutate(Sepal.Length=Sepal.Length*10) in millimeters
• tibble is a modern re-imaginging of the data frame. > iris %>% Create a new column Line Plots
mutate(SLMm=Sepal.Length*10) called SLMm
> by_year <- gapminder %>%
Combine the verbs filter(), arrange(), and mutate(): group_by(year) %>%
• stringr provides a cohesive set of functions designed to make summarize(medianGdpPerCap=median(gdpPercap))
> iris %>%
working with strings as easy as posssible > ggplot(by_year, aes(x=year,
filter(Species=="Virginica") %>%
y=medianGdpPerCap))+
mutate(SLMm=Sepal.Length*10) %>% geom_line()+
• forcats provide a suite of useful tools that solve common problems arrange(desc(SLMm)) expand_limits(y=0)
with factors.
Summarize Bar Plots
You can install the complete tidyverse with:
> install.packages("tidyverse") summarize() allows you to turn many observations into a single data point.
> by_species <- iris %>%
> iris %>% Summarize to find the filter(Sepal.Length>6) %>%
Then, load the core tidyverse and make it available in your current R summarize(medianSL=median(Sepal.Length)) median sepal length group_by(Species) %>%
session by running: > iris %>% Filter for virginica then summarize(medianPL=median(Petal.Length))
> library(tidyverse) filter(Species=="virginica") %>% summarize the median > ggplot(by_species, aes(x=Species,
summarize(medianSL=median(Sepal.Length)) sepal length y=medianPL)) +
Note: there are many other tidyverse packages with more specialised usage. They are not geom_col()
loaded automatically with library(tidyverse), so you’ll need to load each one with its own call You can also summarize multiple variables at once:
to library().
> iris %>% Histograms
Useful Functions filter(Species=="virginica") %>%
summarize(medianSL=median(Sepal.Length), > ggplot(iris_small, aes(x=Petal.Length))+
> tidyverse_conflicts() Conflicts between tidyverse and other maxSL=max(Sepal.Length)) geom_histogram()
packages
> tidyverse_deps() List all tidyverse dependencies group_by() allows you to summarize within groups instead of summarizing the
> tidyverse_logo() Get tidyverse logo, using ASCII or unicode entire dataset:
characters
> tidyverse_packages() List all tidyverse packages
> iris %>% Find median and max Box Plots
group_by(Species) %>% sepal length of each
> tidyverse_update() Update tidyverse packages summarize(medianSL=median(Sepal.Length), species > ggplot(iris_small, aes(x=Species,
maxSL=max(Sepal.Length)) y=Sepal.Width))+
Loading in the data > iris %>%
filter(Sepal.Length>6) %>%
Find median and max
petal length of each
geom_boxplot()

> library(datasets) Load the datasets package group_by(Species) %>% species with sepal
> library(gapminder) Load the gapminder package summarize(medianPL=median(Petal.Length), length > 6
> attach(iris) Attach iris data to the R search path maxPL=max(Petal.Length)) DataCamp
Learn R for Data Science Interactively
Python For Data Science Cheat Sheet 3 Renderers & Visual Customizations
Bokeh Glyphs Grid Layout
Learn Bokeh Interactively at www.DataCamp.com, Scatter Markers >>> from bokeh.layouts import gridplot
taught by Bryan Van de Ven, core contributor >>> p1.circle(np.array([1,2,3]), np.array([3,2,1]), >>> row1 = [p1,p2]
fill_color='white') >>> row2 = [p3]
>>> p2.square(np.array([1.5,3.5,5.5]), [1,4,3], >>> layout = gridplot([[p1,p2],[p3]])
color='blue', size=1)
Plotting With Bokeh Line Glyphs Tabbed Layout
>>> p1.line([1,2,3,4], [3,4,5,6], line_width=2)
>>> p2.multi_line(pd.DataFrame([[1,2,3],[5,6,7]]), >>> from bokeh.models.widgets import Panel, Tabs
The Python interactive visualization library Bokeh >>> tab1 = Panel(child=p1, title="tab1")
pd.DataFrame([[3,4,5],[3,2,1]]),
enables high-performance visual presentation of color="blue") >>> tab2 = Panel(child=p2, title="tab2")
>>> layout = Tabs(tabs=[tab1, tab2])
large datasets in modern web browsers.
Customized Glyphs Also see Data
Linked Plots
Bokeh’s mid-level general purpose bokeh.plotting Selection and Non-Selection Glyphs
>>> p = figure(tools='box_select') Linked Axes
interface is centered around two main components: data >>> p.circle('mpg', 'cyl', source=cds_df, >>> p2.x_range = p1.x_range
and glyphs. selection_color='red', >>> p2.y_range = p1.y_range
nonselection_alpha=0.1) Linked Brushing
>>> p4 = figure(plot_width = 100,
+ = Hover Glyphs tools='box_select,lasso_select')
>>> from bokeh.models import HoverTool
>>> p4.circle('mpg', 'cyl', source=cds_df)
data glyphs plot >>> hover = HoverTool(tooltips=None, mode='vline')
>>> p5 = figure(plot_width = 200,
>>> p3.add_tools(hover)
tools='box_select,lasso_select')
The basic steps to creating plots with the bokeh.plotting >>> p5.circle('mpg', 'hp', source=cds_df)
interface are: US
Colormapping >>> layout = row(p4,p5)
1. Prepare some data: >>> from bokeh.models import CategoricalColorMapper
Asia
Europe

Python lists, NumPy arrays, Pandas DataFrames and other sequences of values
2. Create a new plot
>>> color_mapper = CategoricalColorMapper(
factors=['US', 'Asia', 'Europe'],
palette=['blue', 'red', 'green'])
4 Output & Export
3. Add renderers for your data, with visual customizations >>> p3.circle('mpg', 'cyl', source=cds_df, Notebook
color=dict(field='origin',
4. Specify where to generate the output transform=color_mapper), >>> from bokeh.io import output_notebook, show
5. Show or save the results legend='Origin') >>> output_notebook()
>>> from bokeh.plotting import figure
>>> from bokeh.io import output_file, show Legend Location HTML
>>> x = [1, 2, 3, 4, 5] Step 1
>>> y = [6, 7, 2, 4, 5] Inside Plot Area Standalone HTML
>>> p = figure(title="simple line example", Step 2 >>> p.legend.location = 'bottom_left' >>> from bokeh.embed import file_html
>>> from bokeh.resources import CDN
x_axis_label='x',
>>> html = file_html(p, CDN, "my_plot")
y_axis_label='y') Outside Plot Area
>>> p.line(x, y, legend="Temp.", line_width=2) Step 3 >>> from bokeh.models import Legend
>>> r1 = p2.asterisk(np.array([1,2,3]), np.array([3,2,1]) >>> from bokeh.io import output_file, show
>>> output_file("lines.html") Step 4 >>> r2 = p2.line([1,2,3,4], [3,4,5,6]) >>> output_file('my_bar_chart.html', mode='cdn')
>>> show(p) Step 5 >>> legend = Legend(items=[("One" ,[p1, r1]),("Two",[r2])],
location=(0, -30)) Components
1 Data Also see Lists, NumPy & Pandas
>>> p.add_layout(legend, 'right')

Legend Orientation
>>> from bokeh.embed import components
>>> script, div = components(p)
Under the hood, your data is converted to Column Data
Sources. You can also do this manually: >>> p.legend.orientation = "horizontal" PNG
>>> import numpy as np >>> p.legend.orientation = "vertical"
>>> from bokeh.io import export_png
>>> import pandas as pd >>> export_png(p, filename="plot.png")
>>> df = pd.DataFrame(np.array([[33.9,4,65, 'US'], Legend Background & Border
[32.4,4,66, 'Asia'],
[21.4,4,109, 'Europe']]), >>> p.legend.border_line_color = "navy" SVG
columns=['mpg','cyl', 'hp', 'origin'], >>> p.legend.background_fill_color = "white"
index=['Toyota', 'Fiat', 'Volvo']) >>> from bokeh.io import export_svgs
>>> from bokeh.models import ColumnDataSource Rows & Columns Layout >>> p.output_backend = "svg"
>>> export_svgs(p, filename="plot.svg")
>>> cds_df = ColumnDataSource(df) Rows
>>> from bokeh.layouts import row

2 Plotting >>> layout = row(p1,p2,p3)


Columns
5 Show or Save Your Plots
>>> from bokeh.plotting import figure >>> from bokeh.layouts import columns >>> show(p1) >>> show(layout)
>>> p1 = figure(plot_width=300, tools='pan,box_zoom') >>> layout = column(p1,p2,p3) >>> save(p1) >>> save(layout)
>>> p2 = figure(plot_width=300, plot_height=300, Nesting Rows & Columns
x_range=(0, 8), y_range=(0, 8)) >>>layout = row(column(p1,p2), p3) DataCamp
>>> p3 = figure() Learn Python for Data Science Interactively
Working with Different Programming Languages Widgets
Python For Data Science Cheat Sheet Kernels provide computation and communication with front-end interfaces Notebook widgets provide the ability to visualize and control changes
Jupyter Notebook like the notebooks. There are three main kernels: in your data, often as a control like a slider, textbox, etc.
Learn More Python for Data Science Interactively at www.DataCamp.com
You can use them to build interactive GUIs for your notebooks or to
IRkernel IJulia
synchronize stateful and stateless information between Python and
Installing Jupyter Notebook will automatically install the IPython kernel. JavaScript.
Saving/Loading Notebooks Restart kernel Interrupt kernel
Create new notebook Restart kernel & run Interrupt kernel & Download serialized Save notebook
all cells clear all output state of all widget with interactive
Open an existing
Connect back to a models in use widgets
Make a copy of the notebook Restart kernel & run remote notebook
current notebook all cells Embed current
Rename notebook Run other installed
widgets
kernels
Revert notebook to a
Save current notebook
previous checkpoint Command Mode:
and record checkpoint
Download notebook as
Preview of the printed - IPython notebook 15
notebook - Python
- HTML
Close notebook & stop - Markdown 13 14
- reST
running any scripts - LaTeX 1 2 3 4 5 6 7 8 9 10 11 12
- PDF

Writing Code And Text


Code and text are encapsulated by 3 basic cell types: markdown cells, code
cells, and raw NBConvert cells.
Edit Cells Edit Mode: 1. Save and checkpoint 9. Interrupt kernel
2. Insert cell below 10. Restart kernel
3. Cut cell 11. Display characteristics
Cut currently selected cells Copy cells from 4. Copy cell(s) 12. Open command palette
to clipboard clipboard to current 5. Paste cell(s) below 13. Current kernel
cursor position 6. Move cell up 14. Kernel status
Paste cells from Executing Cells 7. Move cell down 15. Log out from notebook server
clipboard above Paste cells from 8. Run current cell
current cell Run selected cell(s) Run current cells down
clipboard below
and create a new one
Paste cells from current cell
below Asking For Help
clipboard on top Run current cells down
Delete current cells
of current cel and create a new one Walk through a UI tour
Split up a cell from above Run all cells
Revert “Delete Cells” List of built-in keyboard
current cursor Run all cells above the Run all cells below
invocation shortcuts
position current cell the current cell Edit the built-in
Merge current cell Merge current cell keyboard shortcuts
Change the cell type of toggle, toggle Notebook help topics
with the one above with the one below current cell scrolling and clear Description of
Move current cell up Move current cell toggle, toggle current outputs markdown available Information on
down scrolling and clear in notebook unofficial Jupyter
Adjust metadata
underlying the Find and replace all output Notebook extensions
Python help topics
current notebook in selected cells IPython help topics
View Cells
Remove cell Copy attachments of NumPy help topics
attachments current cell Toggle display of Jupyter SciPy help topics
Toggle display of toolbar Matplotlib help topics
Paste attachments of Insert image in logo and filename
SymPy help topics
current cell selected cells Toggle display of cell Pandas help topics
action icons:
Insert Cells - None About Jupyter Notebook
- Edit metadata
Toggle line numbers - Raw cell format
Add new cell above the Add new cell below the - Slideshow
current one in cells - Attachments
current one DataCamp
- Tags
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet 3 Plotting With Seaborn
Seaborn Axis Grids
Learn Data Science Interactively at www.DataCamp.com >>> g = sns.FacetGrid(titanic, Subplot grid for plotting conditional >>> h = sns.PairGrid(iris) Subplot grid for plotting pairwise
col="survived", relationships >>> h = h.map(plt.scatter) relationships
row="sex") >>> sns.pairplot(iris) Plot pairwise bivariate distributions
>>> g = g.map(plt.hist,"age") >>> i = sns.JointGrid(x="x", Grid for bivariate plot with marginal
>>> sns.factorplot(x="pclass", Draw a categorical plot onto a y="y", univariate plots
y="survived", Facetgrid data=data)
Statistical Data Visualization With Seaborn hue="sex",
data=titanic)
>>> i = i.plot(sns.regplot,
sns.distplot)
The Python visualization library Seaborn is based on >>> sns.lmplot(x="sepal_width", Plot data and regression model fits >>> sns.jointplot("sepal_length", Plot bivariate distribution
y="sepal_length", across a FacetGrid "sepal_width",
matplotlib and provides a high-level interface for drawing hue="species", data=iris,
attractive statistical graphics. data=iris) kind='kde')

Categorical Plots Regression Plots


Make use of the following aliases to import the libraries: >>> sns.regplot(x="sepal_width", Plot data and a linear regression
Scatterplot
>>> import matplotlib.pyplot as plt y="sepal_length", model fit
>>> sns.stripplot(x="species", Scatterplot with one
>>> import seaborn as sns data=iris,
y="petal_length", categorical variable
data=iris) ax=ax)
The basic steps to creating plots with Seaborn are: >>> sns.swarmplot(x="species", Categorical scatterplot with Distribution Plots
y="petal_length", non-overlapping points
1. Prepare some data data=iris) >>> plot = sns.distplot(data.y, Plot univariate distribution
2. Control figure aesthetics Bar Chart kde=False,
color="b")
3. Plot with Seaborn >>> sns.barplot(x="sex", Show point estimates and
y="survived", confidence intervals with Matrix Plots
4. Further customize your plot hue="class", scatterplot glyphs
>>> sns.heatmap(uniform_data,vmin=0,vmax=1) Heatmap
data=titanic)
>>> import matplotlib.pyplot as plt Count Plot
>>>
>>>
>>>
import seaborn as sns
tips = sns.load_dataset("tips")
sns.set_style("whitegrid") Step 2
Step 1
>>> sns.countplot(x="deck",
data=titanic,
Show count of observations
4 Further Customizations Also see Matplotlib
palette="Greens_d")
>>> g = sns.lmplot(x="tip", Step 3
Point Plot Axisgrid Objects
y="total_bill",
data=tips, >>> sns.pointplot(x="class", Show point estimates and >>> g.despine(left=True) Remove left spine
aspect=2) y="survived", confidence intervals as >>> g.set_ylabels("Survived") Set the labels of the y-axis
>>> g = (g.set_axis_labels("Tip","Total bill(USD)"). hue="sex", rectangular bars >>> g.set_xticklabels(rotation=45) Set the tick labels for x
set(xlim=(0,10),ylim=(0,100))) data=titanic, >>> g.set_axis_labels("Survived", Set the axis labels
Step 4 palette={"male":"g", "Sex")
>>> plt.title("title")
>>> plt.show(g) Step 5 "female":"m"}, >>> h.set(xlim=(0,5), Set the limit and ticks of the
markers=["^","o"], ylim=(0,5), x-and y-axis
linestyles=["-","--"]) xticks=[0,2.5,5],

1
Boxplot yticks=[0,2.5,5])
Data Also see Lists, NumPy & Pandas >>> sns.boxplot(x="alive", Boxplot
Plot
y="age",
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> plt.title("A Title") Add plot title
data=titanic)
>>> uniform_data = np.random.rand(10, 12) >>> plt.ylabel("Survived") Adjust the label of the y-axis
>>> sns.boxplot(data=iris,orient="h") Boxplot with wide-form data
>>> data = pd.DataFrame({'x':np.arange(1,101), >>> plt.xlabel("Sex") Adjust the label of the x-axis
'y':np.random.normal(0,4,100)}) Violinplot >>> plt.ylim(0,100) Adjust the limits of the y-axis
>>> sns.violinplot(x="age", Violin plot >>> plt.xlim(0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> plt.setp(ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)

2 Figure Aesthetics Also see Matplotlib


5 Show or Save Plot Also see Matplotlib
>>> plt.show() Show the plot
Context Functions >>> plt.savefig("foo.png") Save the plot as a figure
>>> f, ax = plt.subplots(figsize=(5,6)) Create a figure and one subplot >>> plt.savefig("foo.png", Save transparent figure
>>> sns.set_context("talk") Set context to "talk" transparent=True)
>>> sns.set_context("notebook", Set context to "notebook",
Seaborn styles font_scale=1.5, Scale font elements and
>>> sns.set() (Re)set the seaborn default
rc={"lines.linewidth":2.5}) override param mapping Close & Clear Also see Matplotlib
>>> sns.set_style("whitegrid") Set the matplotlib parameters Color Palette >>> plt.cla() Clear an axis
>>> sns.set_style("ticks", Set the matplotlib parameters >>> plt.clf() Clear an entire figure
{"xtick.major.size":8, >>> sns.set_palette("husl",3) Define the color palette >>> plt.close() Close a window
"ytick.major.size":8}) >>> sns.color_palette("husl") Use with with to temporarily set palette
>>> sns.axes_style("whitegrid") Return a dict of params or use with >>> flatui = ["#9b59b6","#3498db","#95a5a6","#e74c3c","#34495e","#2ecc71"]
with to temporarily set the style >>> sns.set_palette(flatui) Set your own color palette DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Duplicate Values GroupBy
>>> df = df.dropDuplicates() >>> df.groupBy("age")\ Group by age, count the members
PySpark - SQL Basics .count() \
.show()
in the groups
Learn Python for data science Interactively at www.DataCamp.com Queries
>>> from pyspark.sql import functions as F
Select Filter
>>> df.select("firstName").show() Show all entries in firstName column >>> df.filter(df["age"]>24).show() Filter entries of age, only keep those
>>> df.select("firstName","lastName") \ records of which the values are >24
PySpark & Spark SQL .show()
>>> df.select("firstName", Show all entries in firstName, age

Spark SQL is Apache Spark's module for "age", and type


Sort
explode("phoneNumber") \
working with structured data. .alias("contactInfo")) \
.select("contactInfo.type", >>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
Initializing SparkSession "firstName",
"age") \ >>> df.orderBy(["age","city"],ascending=[0,1])\
A SparkSession can be used create DataFrame, register DataFrame as tables, .show() .collect()
execute SQL over tables, cache tables, and read parquet files. >>> df.select(df["firstName"],df["age"]+ 1) Show all entries in firstName and age,
.show() add 1 to the entries of age
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession \
>>> df.select(df['age'] > 24).show()
When
Show all entries where age >24 Missing & Replacing Values
.builder \ >>> df.select("firstName", Show firstName and 0 or 1 depending
.appName("Python Spark SQL basic example") \ >>> df.na.fill(50).show() Replace null values
F.when(df.age > 30, 1) \ on age >30 >>> df.na.drop().show() Return new df omitting rows with null values
.config("spark.some.config.option", "some-value") \ .otherwise(0)) \
.getOrCreate() >>> df.na \ Return new df replacing one value with
.show() .replace(10, 20) \ another
>>> df[df.firstName.isin("Jane","Boris")] Show firstName if in the given options .show()
Creating DataFrames Like
.collect()

From RDDs
>>> df.select("firstName", Show firstName, and lastName is
df.lastName.like("Smith")) \ TRUE if lastName is like Smith
Repartitioning
.show()
>>> from pyspark.sql.types import * Startswith - Endswith >>> df.repartition(10)\ df with 10 partitions
>>> df.select("firstName", Show firstName, and TRUE if .rdd \
Infer Schema .getNumPartitions()
>>> sc = spark.sparkContext df.lastName \ lastName starts with Sm
.startswith("Sm")) \ >>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition
>>> lines = sc.textFile("people.txt")
.show()
>>> parts = lines.map(lambda l: l.split(",")) >>> df.select(df.lastName.endswith("th")) \ Show last names ending in th
>>>
>>>
people = parts.map(lambda p: Row(name=p[0],age=int(p[1])))
peopledf = spark.createDataFrame(people)
.show() Running SQL Queries Programmatically
Substring
Specify Schema >>> df.select(df.firstName.substr(1, 3) \ Return substrings of firstName Registering DataFrames as Views
>>> people = parts.map(lambda p: Row(name=p[0], .alias("name")) \
age=int(p[1].strip()))) .collect() >>> peopledf.createGlobalTempView("people")
>>> schemaString = "name age" Between >>> df.createTempView("customer")
>>> fields = [StructField(field_name, StringType(), True) for >>> df.select(df.age.between(22, 24)) \ Show age: values are TRUE if between >>> df.createOrReplaceTempView("customer")
field_name in schemaString.split()] .show() 22 and 24
>>> schema = StructType(fields) Query Views
>>> spark.createDataFrame(people, schema).show()
+--------+---+
| name|age|
Add, Update & Remove Columns >>> df5 = spark.sql("SELECT * FROM customer").show()
+--------+---+ >>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
|
|
Mine| 28|
Filip| 29|
Adding Columns .show()
|Jonathan| 30|
+--------+---+ >>> df = df.withColumn('city',df.address.city) \
.withColumn('postalCode',df.address.postalCode) \
From Spark Data Sources .withColumn('state',df.address.state) \
.withColumn('streetAddress',df.address.streetAddress) \
Output
.withColumn('telePhoneNumber', Data Structures
JSON explode(df.phoneNumber.number)) \
>>> df = spark.read.json("customer.json") .withColumn('telePhoneType',
>>> df.show() >>> rdd1 = df.rdd Convert df into an RDD
+--------------------+---+---------+--------+--------------------+ explode(df.phoneNumber.type)) >>> df.toJSON().first() Convert df into a RDD of string
| address|age|firstName |lastName| phoneNumber|
+--------------------+---+---------+--------+--------------------+ >>> df.toPandas() Return the contents of df as Pandas
|[New York,10021,N...| 25|
|[New York,10021,N...| 21|
John|
Jane|
Smith|[[212 555-1234,ho...|
Doe|[[322 888-1234,ho...|
Updating Columns DataFrame
+--------------------+---+---------+--------+--------------------+
>>> df2 = spark.read.load("people.json", format="json")
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') Write & Save to Files
Parquet files Removing Columns >>> df.select("firstName", "city")\
>>> df3 = spark.read.load("users.parquet") .write \
TXT files >>> df = df.drop("address", "phoneNumber") .save("nameAndCity.parquet")
>>> df4 = spark.read.text("people.txt") >>> df = df.drop(df.address).drop(df.phoneNumber) >>> df.select("firstName", "age") \
.write \
.save("namesAndAges.json",format="json")
Inspect Data
>>> df.dtypes Return df column names and data types >>> df.describe().show() Compute summary statistics Stopping SparkSession
>>> df.show() Display the content of df >>> df.columns Return the columns of df
>>> df.count() >>> spark.stop()
>>> df.head() Return first n rows Count the number of rows in df
>>> df.first() Return first row >>> df.distinct().count() Count the number of distinct rows in df
>>> df.take(2) Return the first n rows >>> df.printSchema() Print the schema of df DataCamp
>>> df.schema Return the schema of df >>> df.explain() Print the (logical and physical) plans
Learn Python for Data Science Interactively
R For Data Science Cheat Sheet General form: DT[i, j, by] Advanced Data Table Operations
> DT[.N-1]
data.table “Take DT, subset rows using i, then calculate j grouped by by” > DT[,.N]
Return the penultimate row of the DT
Return the number of rows
> DT[,.(V2,V3)] Return V2 and V3 as a data.table
Learn R for data science Interactively at www.DataCamp.com
Adding/Updating Columns By Reference in j Using := >
>
DT[,list(V2,V3)]
DT[,mean(V3),by=.(V1,V2)]
Return V2 and V3 as a data.table
Return the result of j, grouped by all possible
> DT[,V1:=round(exp(V1),2)] V1 is updated by what is after := V1 V2 V1 combinations of groups specified in by
> DT Return the result by calling DT 1: 1 A 0.4053
2: 1 B 0.4053
V1 V2 V3 V4
data.table 1: 2.72 A -0.1107 1
2: 7.39 B -0.1427 2
3:
4:
1 C 0.4053
2 A -0.6443
5: 2 B -0.6443
data.table is an R package that provides a high-performance 3: 2.72 C -1.8893 3 6: 2 C -0.6443
4: 7.39 A -0.3571 4
version of base R’s data.frame with syntax and feature ... .SD & .SDcols
enhancements for ease of use, convenience and > DT[,c("V1","V2"):=list(round(exp(V1),2), Columns V1 and V2 are updated by
> DT[,print(.SD),by=V2] Look at what .SD contains
LETTERS[4:6])] what is after :=
programming speed. > DT[,':='(V1=round(exp(V1),2), Alternative to the above one. With [], > DT[,.SD[c(1,.N)],by=V2] Select the first and last row grouped by V2
V2=LETTERS[4:6])][] you print the result to the screen > DT[,lapply(.SD,sum),by=V2] Calculate sum of columns in .SD grouped by
Load the package: V1 V2 V3 V4
V2
1: 15.18 D -0.1107 1 > DT[,lapply(.SD,sum),by=V2, Calculate sum of V3 and V4 in .SD grouped by
> library(data.table) .SDcols=c("V3","V4")] V2
2: 1619.71 E -0.1427 2
V2 V3 V4

Creating A data.table
3: 15.18 F -1.8893 3
1: A -0.478 22
4: 1619.71 D -0.3571 4 2: B -0.478 26

> set.seed(45L) Create a data.table > DT[,V1:=NULL] Remove V1 3: C -0.478 30


> DT[,c("V1","V2"):=NULL] Remove columns V1 and V2 > DT[,lapply(.SD,sum),by=V2, Calculate sum of V3 and V4 in .SD grouped by
> DT <- data.table(V1=c(1L,2L), and call it DT .SDcols=paste0("V",3:4)] V2
V2=LETTERS[1:3], > Cols.chosen=c("A","B")
V3=round(rnorm(4),4), > DT[,Cols.Chosen:=NULL] Delete the column with column name
V4=1:12) Cols.chosen
> DT[,(Cols.Chosen):=NULL] Delete the columns specified in the Chaining
variable Cols.chosen
Subsetting Rows Using i > DT <- DT[,.(V4.Sum=sum(V4)),
by=V1]
Calculate sum of V4, grouped by V1

> DT[3:5,] Select 3rd to 5th row Indexing And Keys V1 V4.Sum
> DT[3:5] Select 3rd to 5th row 1: 1 36
> DT[V2=="A"] Select all rows that have value A in column V2 > setkey(DT,V2) A key is set on V2; output is returned invisibly 2: 2 42
> DT[V2 %in% c("A","C")] Select all rows that have value A or C in column V2 > DT["A"] Return all rows where the key column (set to V2) has > DT[V4.Sum>40] Select that group of which the sum is >40
V1 V2 V3 V4 the value A > DT[,.(V4.Sum=sum(V4)), Select that group of which the sum is >40
Manipulating on Columns in j by=V1][V4.Sum>40] (chaining)
1: 1 A -0.2392 1
2: 2 A -1.6148 4 V1 V4.Sum
3: 1 A 1.0498 7 1: 2 42
> DT[,V2] Return V2 as a vector 4: 2 A 0.3262 10 > DT[,.(V4.Sum=sum(V4)), Calculate sum of V4, grouped by V1,
[1] “A” “B” “C” “A” “B” “C” ... > DT[c("A","C")] Return all rows where the key column (V2) has value A or C by=V1][order(-V1)] ordered on V1
> DT[,.(V2,V3)] Return V2 and V3 as a data.table > DT["A",mult="first"] Return first row of all rows that match value A in key V1 V4.Sum
> DT[,sum(V1)] Return the sum of all elements of V1 in a column V2 1: 2 42
[1] 18 vector > DT["A",mult="last"] Return last row of all rows that match value A in key 2: 1 36
> DT[,.(sum(V1),sd(V3))] Return the sum of all elements of V1 and the column V2
V1 V2 std. dev. of V3 in a data.table > DT[c("A","D")] Return all rows where key column V2 has value A or D
1: 18 0.4546055
> DT[,.(Aggregate=sum(V1), The same as the above, with new names
V1 V2 V3 V4
1: 1 A -0.2392 1 set()-Family
Sd.V3=sd(V3))] 2: 2 A -1.6148 4
Aggregate Sd.V3 3: 1 A 1.0498 7 set()
1: 18 0.4546055 4: 2 A 0.3262 10
> DT[,.(V1,Sd.V3=sd(V3))] Select column V2 and compute std. dev. of V3, 5: NA D NA NA Syntax: for (i in from:to) set(DT, row, column, new value)
which returns a single value and gets recycled > DT[c("A","D"),nomatch=0] Return all rows where key column V2 has value A or D > rows <- list(3:4,5:6)
V1 V2 V3 V4
> DT[,.(print(V2), Print column V2 and plot V3 > cols <- 1:2
1: 1 A -0.2392 1
plot(V3), > for(i in seq_along(rows)) Sequence along the values of rows, and
2: 2 A -1.6148 4
NULL)] {set(DT, for the values of cols, set the values of
3: 1 A 1.0498 7
4: 2 A 0.3262 10 i=rows[[i]], those elements equal to NA (invisible)
j=cols[i],
Doing j by Group > DT[c("A","C"),sum(V4)] Return total sum of V4, for rows of key column V2 that
have values A or C value=NA)}
> DT[,.(V4.Sum=sum(V4)),by=V1] Calculate sum of V4 for every group in V1
V1 V4.Sum
> DT[c("A","C"),
sum(V4),
Return sum of column V4 for rows of V2 that have value A,
and anohter sum for rows of V2 that have value C setnames()
1: 1 36 by=.EACHI] Syntax: setnames(DT,"old","new")[]
2: 2 42 V2 V1
1: A 22 > setnames(DT,"V2","Rating") Set name of V2 to Rating (invisible)
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 Change 2 column names (invisible)
by=.(V1,V2)] and V2 2: C 30 > setnames(DT,
> DT[,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in > setkey(DT,V1,V2) Sort by V1 and then by V2 within each group of V1 (invisible) c("V2","V3"),
by=sign(V1-1)] sign(V1-1) > DT[.(2,"C")] Select rows that have value 2 for the first key (V1) and the c("V2.rating","V3.DC"))
value C for the second key (V2)
setnames()
V1 V2 V3 V4
sign V4.Sum
1: 0 36 1: 2 C 0.3262 6
2: 1 42 2: 2 C -1.6148 12
Syntax: setcolorder(DT,"neworder")
> DT[,.(V4.Sum=sum(V4)), The same as the above, with new name > DT[.(2,c("A","C"))] Select rows that have value 2 for the first key (V1) and within
V1 V2 V3 V4 those rows the value A or C for the second key (V2) > setcolorder(DT, Change column ordering to contents
by=.(V1.01=sign(V1-1))] for the variable you’re grouping by
> DT[1:5,.(V4.Sum=sum(V4)), Calculate sum of V4 for every group in V1 1: 2 A -1.6148 4 c("V2","V1","V4","V3")) of the specified vector (invisible)
2: 2 A 0.3262 10
by=V1] after subsetting on the first 5 rows
3: 2 C 0.3262 6
> DT[,.N,by=V1] Count number of rows for every group in
4: 2 C -1.6148 12
DataCamp
V1 Learn Python for Data Science Interactively
R For Data Science Cheat Sheet Export xts Objects
> data_xts <- as.xts(matrix)
Missing Values
> na.omit(xts5) Omit NA values in xts5
xts > tmp <- tempfile()
> write.zoo(data_xts,sep=",",file=tmp)
> xts_last <- na.locf(xts2) Fill missing values in xts2 using
Learn R for data science Interactively at www.DataCamp.com last observation
> xts_last <- na.locf(xts2, Fill missing values in xts2 using
fromLast=TRUE) next observation
Replace & Update > na.approx(xts2) Interpolate NAs using linear
> xts2[dates] <- 0 Replace values in xts2 on dates with 0 approximation
xts > xts5["1961"] <- NA Replace dates from 1961 with NA

eXtensible Time Series (xts) is a powerful package that


> xts2["2016-05-02"] <- NA Replace the value at 1 specific index with NA
Arithmetic Operations
provides an extensible time series class, enabling uniform coredata()or as.numeric()
handling of many R time series classes by extending zoo.
Applying Functions
> ep1 <- endpoints(xts4,on="weeks",k=2) Take index values by time > xts3 + as.numeric(xts2) Addition
[1] 0 5 10 > xts3 * as.numeric(xts4)
Load the package as follows: > ep2 <- endpoints(xts5,on="years")
Multiplication
> coredata(xts4) - xts3 Subtraction
> library(xts) [1] 0 12 24 36 48 60 72 84 96 108 120 132 144 > coredata(xts4) / xts3 Division
> period.apply(xts5,INDEX=ep2,FUN=mean) Calculate the yearly mean
xts Objects > xts5_yearly <- split(xts5,f="years") Split xts5 by year Shifting Index Values
> lapply(xts5_yearly,FUN=mean) Create a list of yearly means
xts objects have three main components: > do.call(rbind, Find the last observation in > xts5 - lag(xts5) Period-over-period differences
- coredata: always a matrix for xts objects, while it could also be a lapply(split(xts5,"years"), each year in xts5 > diff(xts5,lag=12,differences=1) Lagged differences
vector for zoo objects function(w) last(w,n="1 month")))
> do.call(rbind, Calculate cumulative annual
- index: vector of any Date, POSIXct, chron, yearmon, lapply(split(xts5,"years"), passengers
Reindexing
yearqtr, or DateTime classes cumsum)) > xts1 + merge(xts2,index(xts1),fill=0) Addition
- xtsAttributes: arbitrary attributes > rollapply(xts5, 3, sd) Apply sd to rolling margins of xts5 e1
2017-05-04 5.231538
2017-05-05 5.829257
Creating xts Objects Selecting, Subsetting & Indexing 2017-05-06
2017-05-07
4.000000
3.000000
2017-05-08 2.000000
> xts1 <- xts(x=1:10, order.by=Sys.Date()-1:10) Select 2017-05-09 1.000000
> data <- rnorm(5)
> xts1 - merge(xts2,index(xts1),fill=na.locf) Subtraction
> dates <- seq(as.Date("2017-05-01"),length=5,by="days") > mar55 <- xts5["1955-03"] Get value for March 1955 e1
> xts2 <- xts(x=data, order.by=dates) 2017-05-04 5.231538
> xts3 <- xts(x=rnorm(10), Subset 2017-05-05
2017-05-06
5.829257
4.829257
order.by=as.POSIXct(Sys.Date()+1:10), 2017-05-07 3.829257
born=as.POSIXct("1899-05-08")) > xts5_1954 <- xts5["1954"] Get all data from 1954 2017-05-08 2.829257
> xts4 <- xts(x=1:10, order.by=Sys.Date()+1:10) > xts5_janmarch <- xts5["1954/1954-03"] Extract data from Jan to March ‘54 2017-05-09 1.829257
> xts5_janmarch <- xts5["/1954-03"] Get all data until March ‘54
Convert To And From xts > xts4[ep1] Subset xts4 using ep2
Merging
> data(AirPassengers) first() and last() > merge(xts2,xts1,join='inner') Inner join of xts2 and xts1
> xts5 <- as.xts(AirPassengers)
> first(xts4,'1 week') Extract first 1 week xts2 xts1
2017-05-05 -0.8382068 10
Import From Files > first(last(xts4,'1 week'),'3 days') Get first 3 days of the last week of data
> merge(xts2,xts1,join='left',fill=0) Left join of xts2 and xts1,
> dat <- read.csv(tmp_file) Indexing fill empty spots with 0
> xts(dat, order.by=as.Date(rownames(dat),"%m/%d/%Y")) xts2 xts1
2017-05-01 1.7482704 0
> dat_zoo <- read.zoo(tmp_file, > xts2[index(xts3)] Extract rows with the index of xts3 2017-05-02 -0.2314678 0
index.column=0, > days <- c("2017-05-03","2017-05-23") 2017-05-03 0.1685517 0
sep=",", > xts3[days] Extract rows using the vector days 2017-05-04 1.1685649 0
2017-05-05 -0.8382068 10
format="%m/%d/%Y") > xts2[as.POSIXct(days,tz="UTC")] Extract rows using days as POSIXct
> dat_zoo <- read.zoo(tmp,sep=",",FUN=as.yearmon) > index <- which(.indexwday(xts1)==0|.indexwday(xts1)==6) Index of weekend days > rbind(xts1, xts4) Combine xts1 and xts4 by
> dat_xts <- as.xts(dat_zoo) > xts1[index] Extract weekend days of xts1 rows

Inspect Your Data


> core_data <- coredata(xts2) Extract core data of objects Periods, Periodicity & Timestamps Other Useful Functions
> index(xts1) Extract index of objects
> periodicity(xts5) Estimate frequency of observations > .index(xts4) Extract raw numeric index of xts1
Class Attributes > to.yearly(xts5) Convert xts5 to yearly OHLC > .indexwday(xts3) Value of week(day), starting on Sunday,
> to.monthly(xts3) Convert xts3 to monthly OHLC in index of xts3
> to.quarterly(xts5) Convert xts5 to quarterly OHLC > .indexhour(xts3) Value of hour in index of xts3
> indexClass(xts2) Get index class > start(xts3) Extract first observation of xts3
> indexClass(convertIndex(xts,'POSIXct')) Replacing index class > to.period(xts5,period="quarters") Convert to quarterly OHLC > end(xts4) Extract last observation of xts4
> indexTZ(xts5) Get index class > to.period(xts5,period="years") Convert to yearly OHLC > str(xts3) Display structure of xts3
> indexFormat(xts5) <- "%Y-%m-%d" Change format of time display > nmonths(xts5) Count the months in xts5 > time(xts1) Extract raw numeric index of xts1
> nquarters(xts5) Count the quarters in xts5 > head(xts2) First part of xts2
Time Zones > nyears(xts5) Count the years in xts5 > tail(xts2) Last part of xts2
> make.index.unique(xts3,eps=1e-4) Make index unique
> tzone(xts1) <- "Asia/Hong_Kong" Change the time zone > make.index.unique(xts3,drop=TRUE) Remove duplicate times
> tzone(xts1) Extract the current time zone > align.time(xts3,n=3600) Round index time to the next n seconds DataCamp
Learn R for Data Science Interactively
Python For Data Science Cheat Sheet Model Architecture Inspect Model
>>> model.output_shape
Sequential Model Model output shape
Keras >>> from keras.models import Sequential
>>>
>>>
model.summary()
model.get_config()
Model summary representation
Model configuration
Learn Python for data science Interactively at www.DataCamp.com >>> model = Sequential() >>> model.get_weights() List all weight tensors in the model
>>> model2 = Sequential()
>>> model3 = Sequential() Compile Model
Multilayer Perceptron (MLP) MLP: Binary Classification
Keras Binary Classification >>> model.compile(optimizer='adam',
loss='binary_crossentropy',
Keras is a powerful and easy-to-use deep learning library for >>> from keras.layers import Dense metrics=['accuracy'])
Theano and TensorFlow that provides a high-level neural >>> model.add(Dense(12, MLP: Multi-Class Classification
input_dim=8, >>> model.compile(optimizer='rmsprop',
networks API to develop and evaluate deep learning models. kernel_initializer='uniform', loss='categorical_crossentropy',
activation='relu')) metrics=['accuracy'])
A Basic Example >>> model.add(Dense(8,kernel_initializer='uniform',activation='relu'))
MLP: Regression
>>> model.add(Dense(1,kernel_initializer='uniform',activation='sigmoid')) >>> model.compile(optimizer='rmsprop',
>>> import numpy as np loss='mse',
>>> from keras.models import Sequential Multi-Class Classification metrics=['mae'])
>>> from keras.layers import Dense >>> from keras.layers import Dropout
>>> data = np.random.random((1000,100)) >>> model.add(Dense(512,activation='relu',input_shape=(784,))) Recurrent Neural Network
>>> labels = np.random.randint(2,size=(1000,1)) >>> model.add(Dropout(0.2)) >>> model3.compile(loss='binary_crossentropy',
>>> model = Sequential() optimizer='adam',
>>> model.add(Dense(512,activation='relu')) metrics=['accuracy'])
>>> model.add(Dense(32, >>> model.add(Dropout(0.2))
activation='relu', >>> model.add(Dense(10,activation='softmax'))

>>>
input_dim=100))
model.add(Dense(1, activation='sigmoid'))
Regression Model Training
>>> model.compile(optimizer='rmsprop', >>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1])) >>> model3.fit(x_train4,
loss='binary_crossentropy', >>> model.add(Dense(1)) y_train4,
metrics=['accuracy']) batch_size=32,
>>> model.fit(data,labels,epochs=10,batch_size=32) Convolutional Neural Network (CNN) epochs=15,
verbose=1,
>>> predictions = model.predict(data) >>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten validation_data=(x_test4,y_test4))
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
Data Also see NumPy, Pandas & Scikit-Learn >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(32,(3,3))) Evaluate Your Model's Performance
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ide- >>> model2.add(Activation('relu')) >>> score = model3.evaluate(x_test,
>>> model2.add(MaxPooling2D(pool_size=(2,2))) y_test,
ally, you split the data in training and test sets, for which you can also resort batch_size=32)
>>> model2.add(Dropout(0.25))
to the train_test_split module of sklearn.cross_validation.
>>> model2.add(Conv2D(64,(3,3), padding='same'))
Keras Data Sets >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(64,(3, 3)))
Prediction
>>> from keras.datasets import boston_housing, >>> model2.add(Activation('relu')) >>> model3.predict(x_test4, batch_size=32)
mnist, >>> model2.add(MaxPooling2D(pool_size=(2,2))) >>> model3.predict_classes(x_test4,batch_size=32)
cifar10, >>> model2.add(Dropout(0.25))
imdb
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data()
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()
>>>
>>>
model2.add(Flatten())
model2.add(Dense(512))
Save/ Reload Models
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> model2.add(Activation('relu')) >>> from keras.models import load_model
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> model2.add(Dropout(0.5)) >>> model3.save('model_file.h5')
>>> num_classes = 10 >>> my_model = load_model('my_model.h5')
>>> model2.add(Dense(num_classes))
>>> model2.add(Activation('softmax'))
Other
Recurrent Neural Network (RNN) Model Fine-tuning
>>> from urllib.request import urlopen
>>> data = np.loadtxt(urlopen("http://archive.ics.uci.edu/
ml/machine-learning-databases/pima-indians-diabetes/
>>> from keras.klayers import Embedding,LSTM Optimization Parameters
pima-indians-diabetes.data"),delimiter=",") >>> model3.add(Embedding(20000,128)) >>> from keras.optimizers import RMSprop
>>> X = data[:,0:8] >>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> opt = RMSprop(lr=0.0001, decay=1e-6)
>>> y = data [:,8] >>> model3.add(Dense(1,activation='sigmoid')) >>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Preprocessing Also see NumPy & Scikit-Learn
Early Stopping
Sequence Padding Train and Test Sets >>> from keras.callbacks import EarlyStopping
>>> from keras.preprocessing import sequence >>> from sklearn.model_selection import train_test_split >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, >>> model3.fit(x_train4,
>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) y,
test_size=0.33, y_train4,
random_state=42) batch_size=32,
One-Hot Encoding epochs=15,
>>> from keras.utils import to_categorical Standardization/Normalization validation_data=(x_test4,y_test4),
>>> Y_train = to_categorical(y_train, num_classes) >>> from sklearn.preprocessing import StandardScaler callbacks=[early_stopping_monitor])
>>> Y_test = to_categorical(y_test, num_classes) >>> scaler = StandardScaler().fit(x_train2)
>>> Y_train3 = to_categorical(y_train3, num_classes) >>> standardized_X = scaler.transform(x_train2) DataCamp
>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = scaler.transform(x_test2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data
Basic Information Reducing
PySpark - RDD Basics >>> rdd.getNumPartitions() List the number of partitions
>>> rdd.reduceByKey(lambda x,y : x+y)
.collect()
Merge the rdd values for
each key
Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count() Count RDD instances [('a',9),('b',2)]
3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values
>>> rdd.countByKey() Count RDD instances by key ('a',7,'a',2,'b',2)
defaultdict(<type 'int'>,{'a':2,'b':1}) Grouping by
>>> rdd.countByValue() Count RDD instances by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values
Spark defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() Return (key,value) pairs as a
.mapValues(list)
.collect()
PySpark is the Spark Python API that exposes {'a': 2,'b': 2} dictionary >>> rdd.groupByKey() Group rdd by key
>>> rdd3.sum() Sum of RDD elements .mapValues(list)
the Spark programming model to Python. 4950 .collect()
>>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])]
True
Initializing Spark Summary
Aggregating
>>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
SparkContext >>> rdd3.max() Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each
99 (4950,100) partition and then the results
>>> from pyspark import SparkContext >>> rdd3.min() Minimum value of RDD elements
>>> sc = SparkContext(master = 'local[2]') >>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key
0
>>> rdd3.mean() Mean value of RDD elements .collect()
Inspect SparkContext 49.5 [('a',(9,2)), ('b',(2,1))]
>>> rdd3.stdev() Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each
>>> sc.version Retrieve SparkContext version 28.866070047722118 4950 partition, and then the results
>>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key
>>> sc.master Master URL to connect to 833.25 .collect()
>>> str(sc.sparkHome) Path where Spark is installed on worker nodes >>> rdd3.histogram(3) Compute histogram by bins [('a',9),('b',2)]
>>> str(sc.sparkUser()) Retrieve name of the Spark User running ([0,33,66,99],[33,33,34])
>>> rdd3.stats() Summary statistics (count, mean, stdev, max & >>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by
SparkContext
>>> sc.appName Return application name min) .collect() applying a function
>>> sc.applicationId Retrieve application ID
>>>
>>>
sc.defaultParallelism Return default level of parallelism
sc.defaultMinPartitions Default minimum number of partitions for Applying Functions Mathematical Operations
RDDs >>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd.subtract(rdd2) Return each rdd value not contained
.collect() .collect() in rdd2
Configuration [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] [('b',2),('a',7)]
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2
>>> from pyspark import SparkConf, SparkContext and flatten the result .collect() with no matching key in rdd
>>> conf = (SparkConf() >>> rdd5.collect() [('d', 1)]
.setMaster("local") ['a',7,7,'a','a',2,2,'a','b',2,2,'b'] >>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd
.setAppName("My app") >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each (key,value) and rdd2
.set("spark.executor.memory", "1g")) .collect() pair of rdd4 without changing the keys
>>> sc = SparkContext(conf = conf) [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]
Sort
Using The Shell Selecting Data >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function
.collect()
In the PySpark shell, a special interpreter-aware SparkContext is already Getting [('d',1),('b',1),('a',2)]
created in the variable called sc. >>> rdd.collect() Return a list with all RDD elements >>> rdd2.sortByKey() Sort (key, value) RDD by key
[('a', 7), ('a', 2), ('b', 2)] .collect()
$ ./bin/spark-shell --master local[2] >>> rdd.take(2) Take first 2 RDD elements [('a',2),('b',1),('d',1)]
$ ./bin/pyspark --master local[4] --py-files code.py [('a', 7), ('a', 2)]
>>> rdd.first() Take first RDD element
Set which master the context connects to with the --master argument, and
('a', 7) Repartitioning
add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd.top(2) Take top 2 RDD elements
[('b', 2), ('a', 7)] >>> rdd.repartition(4) New RDD with 4 partitions
comma-separated list to --py-files. >>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1
Sampling
Loading Data >>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97] Saving
Parallelized Collections Filtering >>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)]) .collect() 'org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) [('a',7),('a',2)]
>>> rdd3 = sc.parallelize(range(100)) >>> rdd5.distinct().collect() Return distinct RDD values
>>> rdd4 = sc.parallelize([("a",["x","y","z"]),
("b",["p", "r"])])
['a',2,'b',7]
>>> rdd.keys().collect() Return (key,value) RDD's keys
Stopping SparkContext
['a', 'a', 'b'] >>> sc.stop()
External Data
Read either one text file from HDFS, a local file system or or any Iterating Execution
Hadoop-supported file system URI with textFile(), or read in a directory >>> def g(x): print(x)
>>> rdd.foreach(g) Apply a function to all RDD elements $ ./bin/spark-submit examples/src/main/python/pi.py
of text files with wholeTextFiles().
('a', 7)
>>> textFile = sc.textFile("/my/directory/*.txt") ('b', 2) DataCamp
>>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Advanced Indexing Also see NumPy Arrays Combining Data
Selecting data1 data2
Pandas >>> df3.loc[:,(df3>1).any()] Select cols with any vals >1 X1 X2 X1 X3
Learn Python for Data Science Interactively at www.DataCamp.com >>> df3.loc[:,(df3>1).all()] Select cols with vals > 1
>>> df3.loc[:,df3.isnull().any()] Select cols with NaN a 11.432 a 20.784
>>> df3.loc[:,df3.notnull().all()] Select cols without NaN b 1.303 b NaN
Indexing With isin c 99.906 d 20.784
>>> df[(df.Country.isin(df2.Type))] Find same elements
Reshaping Data >>> df3.filter(items=”a”,”b”]) Filter on values
Merge
>>> df.select(lambda x: not x%5) Select specific elements
Pivot Where X1 X2 X3
>>> pd.merge(data1,
>>> df3= df2.pivot(index='Date', Spread rows into columns >>> s.where(s > 0) Subset the data data2, a 11.432 20.784
columns='Type', Query how='left',
values='Value') b 1.303 NaN
>>> df6.query('second > first') Query DataFrame on='X1')
c 99.906 NaN
Date Type Value

0 2016-03-01 a 11.432 Type a b c Setting/Resetting Index >>> pd.merge(data1, X1 X2 X3


1 2016-03-02 b 13.031 Date data2, a 11.432 20.784
>>> df.set_index('Country') Set the index
how='right',
2 2016-03-01 c 20.784 2016-03-01 11.432 NaN 20.784 >>> df4 = df.reset_index() Reset the index b 1.303 NaN
on='X1')
3 2016-03-03 a 99.906 >>> df = df.rename(index=str, Rename DataFrame d NaN 20.784
2016-03-02 1.303 13.031 NaN columns={"Country":"cntry",
4 2016-03-02 a 1.303 "Capital":"cptl", >>> pd.merge(data1,
2016-03-03 99.906 NaN 20.784 "Population":"ppltn"}) X1 X2 X3
5 2016-03-03 c 20.784 data2,
how='inner', a 11.432 20.784
Pivot Table Reindexing on='X1') b 1.303 NaN
>>> s2 = s.reindex(['a','c','d','e','b'])
>>> df4 = pd.pivot_table(df2, Spread rows into columns X1 X2 X3
values='Value', Forward Filling Backward Filling >>> pd.merge(data1,
index='Date', data2, a 11.432 20.784
columns='Type']) >>> df.reindex(range(4), >>> s3 = s.reindex(range(5), how='outer', b 1.303 NaN
method='ffill') method='bfill') on='X1') c 99.906 NaN
Stack / Unstack Country Capital Population 0 3
0 Belgium Brussels 11190846 1 3 d NaN 20.784
>>> stacked = df5.stack() Pivot a level of column labels 1 India New Delhi 1303171035 2 3
>>> stacked.unstack() Pivot a level of index labels 2 Brazil Brasília 207847528 3 3 Join
3 Brazil Brasília 207847528 4 3
0 1 1 5 0 0.233482 >>> data1.join(data2, how='right')
1 5 0.233482 0.390959 1 0.390959 MultiIndexing Concatenate
2 4 0.184713 0.237102 2 4 0 0.184713
>>> arrays = [np.array([1,2,3]),
3 3 0.433522 0.429401 1 0.237102 np.array([5,4,3])] Vertical
>>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays) >>> s.append(s2)
Unstacked 3 3 0 0.433522
>>> tuples = list(zip(*arrays)) Horizontal/Vertical
1 0.429401 >>> index = pd.MultiIndex.from_tuples(tuples, >>> pd.concat([s,s2],axis=1, keys=['One','Two'])
Stacked names=['first', 'second']) >>> pd.concat([data1, data2], axis=1, join='inner')
>>> df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
Melt >>> df2.set_index(["Date", "Type"])
>>> pd.melt(df2, Gather columns into rows
Dates
id_vars=["Date"],
value_vars=["Type", "Value"],
Duplicate Data >>> df2['Date']= pd.to_datetime(df2['Date'])
>>> df2['Date']= pd.date_range('2000-1-1',
value_name="Observations") >>> s3.unique() Return unique values periods=6,
>>> df2.duplicated('Type') Check duplicates freq='M')
Date Type Value
Date Variable Observations >>> dates = [datetime(2012,5,1), datetime(2012,5,2)]
0 2016-03-01 Type a >>> df2.drop_duplicates('Type', keep='last') Drop duplicates >>> index = pd.DatetimeIndex(dates)
0 2016-03-01 a 11.432 1 2016-03-02 Type b
>>> df.index.duplicated() Check index duplicates >>> index = pd.date_range(datetime(2012,2,1), end, freq='BM')
1 2016-03-02 b 13.031 2 2016-03-01 Type c
2 2016-03-01 c 20.784 3 2016-03-03 Type a Grouping Data Visualization Also see Matplotlib
4 2016-03-02 Type a
3 2016-03-03 a 99.906
5 2016-03-03 Type c Aggregation >>> import matplotlib.pyplot as plt
4 2016-03-02 a 1.303 >>> df2.groupby(by=['Date','Type']).mean()
6 2016-03-01 Value 11.432 >>> s.plot() >>> df2.plot()
>>> df4.groupby(level=0).sum()
5 2016-03-03 c 20.784 7 2016-03-02 Value 13.031 >>> df4.groupby(level=0).agg({'a':lambda x:sum(x)/len(x), >>> plt.show() >>> plt.show()
8 2016-03-01 Value 20.784 'b': np.sum})
9 2016-03-03 Value 99.906 Transformation
>>> customSum = lambda x: (x+x%2)
10 2016-03-02 Value 1.303
>>> df4.groupby(level=0).transform(customSum)
11 2016-03-03 Value 20.784

Iteration Missing Data


>>> df.dropna() Drop NaN values
>>> df.iteritems() (Column-index, Series) pairs >>> df3.fillna(df3.mean()) Fill NaN values with a predetermined value
>>> df.iterrows() (Row-index, Series) pairs >>> df2.replace("a", "f") Replace values with others
DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Linear Algebra Also see NumPy
You’ll use the linalg and sparse modules. Note that scipy.linalg contains and expands on numpy.linalg.
SciPy - Linear Algebra >>> from scipy import linalg, sparse Matrix Functions
Learn More Python for Data Science Interactively at www.datacamp.com
Creating Matrices Addition
>>> np.add(A,D) Addition
>>> A = np.matrix(np.random.random((2,2)))
SciPy >>> B = np.asmatrix(b) Subtraction
>>> C = np.mat(np.random.random((10,5))) >>> np.subtract(A,D) Subtraction
The SciPy library is one of the core packages for >>> D = np.mat([[3,4], [5,6]]) Division
scientific computing that provides mathematical >>> np.divide(A,D) Division
Basic Matrix Routines Multiplication
algorithms and convenience functions built on the
>>> np.multiply(D,A) Multiplication
NumPy extension of Python. Inverse >>> np.dot(A,D) Dot product
>>> A.I Inverse >>> np.vdot(A,D) Vector dot product
>>> linalg.inv(A) Inverse
Interacting With NumPy Also see NumPy >>> A.T Tranpose matrix >>> np.inner(A,D) Inner product
>>> np.outer(A,D) Outer product
>>> import numpy as np >>> A.H Conjugate transposition >>> np.tensordot(A,D) Tensor dot product
>>> a = np.array([1,2,3]) >>> np.trace(A) Trace >>> np.kron(A,D) Kronecker product
>>> b = np.array([(1+5j,2j,3j), (4j,5j,6j)])
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]]) Norm Exponential Functions
>>> linalg.norm(A) Frobenius norm >>> linalg.expm(A) Matrix exponential
Index Tricks >>> linalg.norm(A,1) L1 norm (max column sum) >>> linalg.expm2(A) Matrix exponential (Taylor Series)
>>> linalg.norm(A,np.inf) L inf norm (max row sum) >>> linalg.expm3(D) Matrix exponential (eigenvalue
>>> np.mgrid[0:5,0:5] Create a dense meshgrid decomposition)
>>> np.ogrid[0:2,0:2] Create an open meshgrid Rank Logarithm Function
>>> np.r_[[3,[0]*5,-1:1:10j] Stack arrays vertically (row-wise) >>> np.linalg.matrix_rank(C) Matrix rank >>> linalg.logm(A) Matrix logarithm
>>> np.c_[b,c] Create stacked column-wise arrays Determinant Trigonometric Tunctions
>>> linalg.det(A) Determinant >>> linalg.sinm(D) Matrix sine
Shape Manipulation Solving linear problems >>> linalg.cosm(D) Matrix cosine
>>> np.transpose(b) Permute array dimensions >>> linalg.solve(A,b) Solver for dense matrices >>> linalg.tanm(A) Matrix tangent
>>> b.flatten() Flatten the array >>> E = np.mat(a).T Solver for dense matrices Hyperbolic Trigonometric Functions
>>> np.hstack((b,c)) Stack arrays horizontally (column-wise) >>> linalg.lstsq(D,E) Least-squares solution to linear matrix >>> linalg.sinhm(D) Hypberbolic matrix sine
>>> np.vstack((a,b)) Stack arrays vertically (row-wise) equation >>> linalg.coshm(D) Hyperbolic matrix cosine
>>> np.hsplit(c,2) Split the array horizontally at the 2nd index Generalized inverse >>> linalg.tanhm(A) Hyperbolic matrix tangent
>>> np.vpslit(d,2) Split the array vertically at the 2nd index >>> linalg.pinv(C) Compute the pseudo-inverse of a matrix Matrix Sign Function
(least-squares solver) >>> np.sigm(A) Matrix sign function
Polynomials >>> linalg.pinv2(C) Compute the pseudo-inverse of a matrix
>>> from numpy import poly1d (SVD) Matrix Square Root
>>> linalg.sqrtm(A) Matrix square root
>>> p = poly1d([3,4,5]) Create a polynomial object
Creating Sparse Matrices Arbitrary Functions
Vectorizing Functions >>> linalg.funm(A, lambda x: x*x) Evaluate matrix function
>>> F = np.eye(3, k=1) Create a 2X2 identity matrix
>>> def myfunc(a):
if a < 0: >>> G = np.mat(np.identity(2)) Create a 2x2 identity matrix Decompositions
return a*2 >>> C[C > 0.5] = 0
else: >>> H = sparse.csr_matrix(C)
return a/2
Compressed Sparse Row matrix Eigenvalues and Eigenvectors
>>> I = sparse.csc_matrix(D) Compressed Sparse Column matrix >>> la, v = linalg.eig(A) Solve ordinary or generalized
>>> np.vectorize(myfunc) Vectorize functions >>> J = sparse.dok_matrix(A) Dictionary Of Keys matrix eigenvalue problem for square matrix
>>> E.todense() Sparse matrix to full matrix >>> l1, l2 = la Unpack eigenvalues
Type Handling >>> sparse.isspmatrix_csc(A) Identify sparse matrix >>> v[:,0] First eigenvector
>>> v[:,1] Second eigenvector
>>> np.real(c) Return the real part of the array elements
>>> np.imag(c) Return the imaginary part of the array elements Sparse Matrix Routines >>> linalg.eigvals(A) Unpack eigenvalues
>>> np.real_if_close(c,tol=1000) Return a real array if complex parts close to 0 Singular Value Decomposition
>>> np.cast['f'](np.pi) Cast object to a data type Inverse >>> U,s,Vh = linalg.svd(B) Singular Value Decomposition (SVD)
>>> sparse.linalg.inv(I) Inverse >>> M,N = B.shape
Other Useful Functions Norm >>> Sig = linalg.diagsvd(s,M,N) Construct sigma matrix in SVD
>>> sparse.linalg.norm(I) Norm LU Decomposition
>>> np.angle(b,deg=True) Return the angle of the complex argument >>> P,L,U = linalg.lu(C) LU Decomposition
>>> g = np.linspace(0,np.pi,num=5) Create an array of evenly spaced values
Solving linear problems
(number of samples) >>> sparse.linalg.spsolve(H,I) Solver for sparse matrices
>>> g [3:] += np.pi
>>> np.unwrap(g) Unwrap Sparse Matrix Decompositions
>>> np.logspace(0,10,3) Create an array of evenly spaced values (log scale) Sparse Matrix Functions
>>> la, v = sparse.linalg.eigs(F,1) Eigenvalues and eigenvectors
>>> np.select([c<4],[c*2]) Return values from a list of arrays depending on >>> sparse.linalg.expm(I) Sparse matrix exponential >>> sparse.linalg.svds(H, 2) SVD
conditions
>>> misc.factorial(a) Factorial
>>> Combine N things taken at k time
>>>
misc.comb(10,3,exact=True)
misc.central_diff_weights(3) Weights for Np-point central derivative Asking For Help DataCamp
>>> misc.derivative(myfunc,1.0) Find the n-th derivative of a function at a point >>> help(scipy.linalg.diagsvd)
>>> np.info(np.matrix) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Inspecting Your Array Subsetting, Slicing, Indexing Also see Lists
>>> a.shape Array dimensions Subsetting
NumPy Basics >>>
>>>
len(a)
b.ndim
Length of array
Number of array dimensions
>>> a[2]
3
1 2 3 Select the element at the 2nd index
Learn Python for Data Science Interactively at www.DataCamp.com >>> e.size Number of array elements >>> b[1,2] 1.5 2 3 Select the element at row 1 column 2
>>> b.dtype Data type of array elements 6.0 4 5 6 (equivalent to b[1][2])
>>> b.dtype.name Name of data type
>>> b.astype(int) Convert an array to a different type Slicing
NumPy >>> a[0:2]
array([1, 2])
1 2 3 Select items at index 0 and 1
2
The NumPy library is the core library for scientific computing in Asking For Help >>> b[0:2,1] 1.5 2 3 Select items at rows 0 and 1 in column 1
>>> np.info(np.ndarray.dtype) array([ 2., 5.]) 4 5 6
Python. It provides a high-performance multidimensional array
Array Mathematics
1.5 2 3
>>> b[:1] Select all items at row 0
object, and tools for working with these arrays. array([[1.5, 2., 3.]]) 4 5 6 (equivalent to b[0:1, :])
Arithmetic Operations >>> c[1,...] Same as [1,:,:]
Use the following import convention: array([[[ 3., 2., 1.],
>>> import numpy as np [ 4., 5., 6.]]])
>>> g = a - b Subtraction
array([[-0.5, 0. , 0. ], >>> a[ : :-1] Reversed array a
NumPy Arrays [-3. , -3. , -3. ]])
array([3, 2, 1])

>>> np.subtract(a,b) Boolean Indexing


1D array 2D array 3D array Subtraction
>>> a[a<2] Select elements from a less than 2
>>> b + a Addition 1 2 3
array([[ 2.5, 4. , 6. ], array([1])
axis 1 axis 2
1 2 3 axis 1 [ 5. , 7. , 9. ]]) Fancy Indexing
1.5 2 3 >>> np.add(b,a) Addition >>> b[[1, 0, 1, 0],[0, 1, 2, 0]] Select elements (1,0),(0,1),(1,2) and (0,0)
axis 0 axis 0 array([ 4. , 2. , 6. , 1.5])
4 5 6 >>> a / b Division
array([[ 0.66666667, 1. , 1. ], >>> b[[1, 0, 1, 0]][:,[0,1,2,0]] Select a subset of the matrix’s rows
[ 0.25 , 0.4 , 0.5 ]]) array([[ 4. ,5. , 6. , 4. ], and columns
>>> np.divide(a,b) Division [ 1.5, 2. , 3. , 1.5],
Creating Arrays >>> a * b
array([[ 1.5, 4. , 9. ],
Multiplication
[ 4. , 5.
[ 1.5, 2.
,
,
6.
3.
,
,
4. ],
1.5]])

>>> a = np.array([1,2,3]) [ 4. , 10. , 18. ]])


>>> b = np.array([(1.5,2,3), (4,5,6)], dtype = float) >>> np.multiply(a,b) Multiplication Array Manipulation
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]], >>> np.exp(b) Exponentiation
dtype = float) >>> np.sqrt(b) Square root Transposing Array
>>> np.sin(a) Print sines of an array >>> i = np.transpose(b) Permute array dimensions
Initial Placeholders >>> np.cos(b) Element-wise cosine >>> i.T Permute array dimensions
>>> np.log(a) Element-wise natural logarithm
>>> np.zeros((3,4)) Create an array of zeros >>> e.dot(f) Dot product
Changing Array Shape
>>> np.ones((2,3,4),dtype=np.int16) Create an array of ones array([[ 7., 7.], >>> b.ravel() Flatten the array
>>> d = np.arange(10,25,5) Create an array of evenly [ 7., 7.]]) >>> g.reshape(3,-2) Reshape, but don’t change data
spaced values (step value)
>>> np.linspace(0,2,9) Create an array of evenly Comparison Adding/Removing Elements
spaced values (number of samples) >>> h.resize((2,6)) Return a new array with shape (2,6)
>>> e = np.full((2,2),7) Create a constant array >>> a == b Element-wise comparison >>> np.append(h,g) Append items to an array
>>> f = np.eye(2) Create a 2X2 identity matrix array([[False, True, True], >>> np.insert(a, 1, 5) Insert items in an array
>>> np.random.random((2,2)) Create an array with random values [False, False, False]], dtype=bool) >>> np.delete(a,[1]) Delete items from an array
>>> np.empty((3,2)) Create an empty array >>> a < 2 Element-wise comparison
array([True, False, False], dtype=bool) Combining Arrays
>>> np.array_equal(a, b) Array-wise comparison >>> np.concatenate((a,d),axis=0) Concatenate arrays
I/O array([ 1, 2,
>>> np.vstack((a,b))
3, 10, 15, 20])
Stack arrays vertically (row-wise)
Aggregate Functions array([[ 1. , 2. , 3. ],
Saving & Loading On Disk [ 1.5, 2. , 3. ],
>>> a.sum() Array-wise sum [ 4. , 5. , 6. ]])
>>> np.save('my_array', a) >>> a.min() Array-wise minimum value >>> np.r_[e,f] Stack arrays vertically (row-wise)
>>> np.savez('array.npz', a, b) >>> b.max(axis=0) Maximum value of an array row >>> np.hstack((e,f)) Stack arrays horizontally (column-wise)
>>> np.load('my_array.npy') >>> b.cumsum(axis=1) Cumulative sum of the elements array([[ 7., 7., 1., 0.],
>>> a.mean() Mean [ 7., 7., 0., 1.]])
Saving & Loading Text Files >>> b.median() Median >>> np.column_stack((a,d)) Create stacked column-wise arrays
>>> np.loadtxt("myfile.txt") >>> a.corrcoef() Correlation coefficient array([[ 1, 10],
>>> np.std(b) Standard deviation [ 2, 15],
>>> np.genfromtxt("my_file.csv", delimiter=',') [ 3, 20]])
>>> np.savetxt("myarray.txt", a, delimiter=" ") >>> np.c_[a,d] Create stacked column-wise arrays
Copying Arrays Splitting Arrays
Data Types >>> h = a.view() Create a view of the array with the same data >>> np.hsplit(a,3) Split the array horizontally at the 3rd
>>> np.copy(a) Create a copy of the array [array([1]),array([2]),array([3])] index
>>> np.int64 Signed 64-bit integer types >>> np.vsplit(c,2) Split the array vertically at the 2nd index
>>> np.float32 Standard double-precision floating point >>> h = a.copy() Create a deep copy of the array [array([[[ 1.5, 2. , 1. ],
>>> np.complex Complex numbers represented by 128 floats [ 4. , 5. , 6. ]]]),
array([[[ 3., 2., 3.],
>>>
>>>
np.bool
np.object
Boolean type storing TRUE and FALSE values
Python object type Sorting Arrays [ 4., 5., 6.]]])]

>>> np.string_ Fixed-length string type >>> a.sort() Sort an array


>>> np.unicode_ Fixed-length unicode type >>> c.sort(axis=0) Sort the elements of an array's axis DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Create Your Model Evaluate Your Model’s Performance
Supervised Learning Estimators Classification Metrics
Scikit-Learn
Learn Python for data science Interactively at www.DataCamp.com Linear Regression Accuracy Score
>>> from sklearn.linear_model import LinearRegression >>> knn.score(X_test, y_test) Estimator score method
>>> lr = LinearRegression(normalize=True) >>> from sklearn.metrics import accuracy_score Metric scoring functions
>>> accuracy_score(y_test, y_pred)
Support Vector Machines (SVM)
Scikit-learn >>> from sklearn.svm import SVC Classification Report
>>> svc = SVC(kernel='linear') >>> from sklearn.metrics import classification_report Precision, recall, f1-score
Scikit-learn is an open source Python library that Naive Bayes >>> print(classification_report(y_test, y_pred)) and support
implements a range of machine learning, >>> from sklearn.naive_bayes import GaussianNB Confusion Matrix
>>> gnb = GaussianNB() >>> from sklearn.metrics import confusion_matrix
preprocessing, cross-validation and visualization >>> print(confusion_matrix(y_test, y_pred))
algorithms using a unified interface. KNN
>>> from sklearn import neighbors Regression Metrics
A Basic Example >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> from sklearn import neighbors, datasets, preprocessing
Mean Absolute Error
>>> from sklearn.model_selection import train_test_split Unsupervised Learning Estimators >>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.metrics import accuracy_score >>> y_true = [3, -0.5, 2]
>>> iris = datasets.load_iris() Principal Component Analysis (PCA) >>> mean_absolute_error(y_true, y_pred)
>>> X, y = iris.data[:, :2], iris.target >>> from sklearn.decomposition import PCA Mean Squared Error
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) >>> pca = PCA(n_components=0.95) >>> from sklearn.metrics import mean_squared_error
>>> scaler = preprocessing.StandardScaler().fit(X_train) >>> mean_squared_error(y_test, y_pred)
>>> X_train = scaler.transform(X_train)
K Means
>>> X_test = scaler.transform(X_test) >>> from sklearn.cluster import KMeans R² Score
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> k_means = KMeans(n_clusters=3, random_state=0) >>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred) Model Fitting Clustering Metrics
Adjusted Rand Index
Supervised learning >>> from sklearn.metrics import adjusted_rand_score
Loading The Data Also see NumPy & Pandas >>> lr.fit(X, y) Fit the model to the data
>>> adjusted_rand_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse >>> svc.fit(X_train, y_train) Homogeneity
>>> from sklearn.metrics import homogeneity_score
matrices. Other types that are convertible to numeric arrays, such as Pandas Unsupervised Learning >>> homogeneity_score(y_true, y_pred)
DataFrame, are also acceptable. >>> k_means.fit(X_train) Fit the model to the data
>>> pca_model = pca.fit_transform(X_train) Fit to data, then transform it V-measure
>>> import numpy as np >>> from sklearn.metrics import v_measure_score
>>> X = np.random.random((10,5)) >>> metrics.v_measure_score(y_true, y_pred)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0 Prediction Cross-Validation
>>> from sklearn.cross_validation import cross_val_score
Supervised Estimators >>> print(cross_val_score(knn, X_train, y_train, cv=4))
Training And Test Data >>> y_pred = svc.predict(np.random.random((2,5))) Predict labels
>>> y_pred = lr.predict(X_test)
>>> print(cross_val_score(lr, X, y, cv=2))
Predict labels
>>> from sklearn.model_selection import train_test_split >>> y_pred = knn.predict_proba(X_test) Estimate probability of a label
>>> X_train, X_test, y_train, y_test = train_test_split(X,
y, Unsupervised Estimators Tune Your Model
random_state=0) >>> y_pred = k_means.predict(X_test) Predict labels in clustering algos Grid Search
>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3),
Preprocessing The Data "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,
Standardization Encoding Categorical Features param_grid=params)
>>> grid.fit(X_train, y_train)
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import LabelEncoder >>> print(grid.best_score_)
>>> scaler = StandardScaler().fit(X_train) >>> print(grid.best_estimator_.n_neighbors)
>>> enc = LabelEncoder()
>>> standardized_X = scaler.transform(X_train) >>> y = enc.fit_transform(y)
>>> standardized_X_test = scaler.transform(X_test) Randomized Parameter Optimization
Normalization Imputing Missing Values >>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5),
>>> from sklearn.preprocessing import Normalizer "weights": ["uniform", "distance"]}
>>> from sklearn.preprocessing import Imputer >>> rsearch = RandomizedSearchCV(estimator=knn,
>>> scaler = Normalizer().fit(X_train) >>> imp = Imputer(missing_values=0, strategy='mean', axis=0) param_distributions=params,
>>> normalized_X = scaler.transform(X_train) >>> imp.fit_transform(X_train) cv=4,
>>> normalized_X_test = scaler.transform(X_test) n_iter=8,
random_state=5)
Binarization Generating Polynomial Features >>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
>>> from sklearn.preprocessing import Binarizer >>> from sklearn.preprocessing import PolynomialFeatures
>>> binarizer = Binarizer(threshold=0.0).fit(X) >>> poly = PolynomialFeatures(5)
>>> binary_X = binarizer.transform(X) >>> poly.fit_transform(X) DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Asking For Help Dropping
>>> help(pd.Series.loc)
>>> s.drop(['a', 'c']) Drop values from rows (axis=0)
Pandas Basics Selection Also see NumPy Arrays >>> df.drop('Country', axis=1) Drop values from columns(axis=1)
Learn Python for Data Science Interactively at www.DataCamp.com

>>> s['b'] Get one element Sort & Rank


-5
Pandas >>> df.sort_index() Sort by labels along an axis
>>> df.sort_values(by='Country') Sort by the values along an axis
>>> df[1:] Get subset of a DataFrame
The Pandas library is built on NumPy and provides easy-to-use Country Capital Population >>> df.rank() Assign ranks to entries
data structures and data analysis tools for the Python 1 India New Delhi 1303171035
2 Brazil Brasília 207847528
programming language. Retrieving Series/DataFrame Information
Basic Information
Use the following import convention: By Position >>> df.shape (rows,columns)
>>> import pandas as pd >>> df.iloc[[0],[0]] Select single value by row & >>> df.index Describe index
'Belgium' column >>> df.columns Describe DataFrame columns
Pandas Data Structures >>> df.iat([0],[0])
>>>
>>>
df.info()
df.count()
Info on DataFrame
Number of non-NA values
Series 'Belgium'
Summary
A one-dimensional labeled array a 3 By Label
>>> df.loc[[0], ['Country']] Select single value by row & >>> df.sum() Sum of values
capable of holding any data type b -5 'Belgium' column labels >>> df.cumsum() Cummulative sum of values
>>> df.min()/df.max() Minimum/maximum values
Index c 7 >>> df.at([0], ['Country']) >>> df.idxmin()/df.idxmax() Minimum/Maximum index value
d 4 'Belgium' >>> df.describe() Summary statistics
>>> df.mean() Mean of values
>>> s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
By Label/Position >>> df.median() Median of values
>>> df.ix[2] Select single row of
DataFrame Country
Capital
Brazil
Brasília
subset of rows Applying Functions
Population 207847528 >>> f = lambda x: x*2
Columns
Country Capital Population A two-dimensional labeled >>> df.ix[:,'Capital'] Select a single column of >>> df.apply(f) Apply function
>>> df.applymap(f) Apply function element-wise
data structure with columns 0 Brussels subset of columns
0 Belgium Brussels 11190846 1 New Delhi
of potentially different types 2 Brasília Data Alignment
1 India New Delhi 1303171035
Index >>> df.ix[1,'Capital'] Select rows and columns
2 Brazil Brasília 207847528 Internal Data Alignment
'New Delhi'
NA values are introduced in the indices that don’t overlap:
Boolean Indexing
>>> data = {'Country': ['Belgium', 'India', 'Brazil'], >>> s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
>>> s[~(s > 1)] Series s where value is not >1
'Capital': ['Brussels', 'New Delhi', 'Brasília'], >>> s[(s < -1) | (s > 2)] s where value is <-1 or >2 >>> s + s3
'Population': [11190846, 1303171035, 207847528]} >>> df[df['Population']>1200000000] Use filter to adjust DataFrame a 10.0
b NaN
>>> df = pd.DataFrame(data,
c 5.0
columns=['Country', 'Capital', 'Population']) >>> s['a'] = 6 Set index a of Series s to 6
d 7.0

I/O Arithmetic Operations with Fill Methods


You can also do the internal data alignment yourself with
Read and Write to CSV Read and Write to SQL Query or Database Table
the help of the fill methods:
>>> pd.read_csv( , header=None, nrows=5) >>> from sqlalchemy import create_engine
>>> df.to_csv('myDataFrame.csv') >>> engine = create_engine('sqlite:///:memory:') a 10.0
>>> pd.read_sql("SELECT * FROM my_table;", engine) b -5.0
Read and Write to Excel c 5.0
>>> pd.read_sql_table('my_table', engine) d 7.0
>>> pd.read_excel( ) >>> pd.read_sql_query("SELECT * FROM my_table;", engine)
>>> df.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1')
read_sql()is a convenience wrapper around read_sql_table() and
Read multiple sheets from the same file
read_sql_query()
>>> xlsx = pd.ExcelFile( )
>>> df = pd.read_excel(xlsx, 'Sheet1') >>> df.to_sql('myDf', engine) DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Lists Also see NumPy Arrays Libraries
>>> a = 'is' Import libraries
Python Basics >>> b = 'nice' >>> import numpy Data analysis Machine learning
Learn More Python for Data Science Interactively at www.datacamp.com >>> my_list = ['my', 'list', a, b] >>> import numpy as np
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] Selective import
>>> from math import pi Scientific computing 2D plotting
Variables and Data Types Selecting List Elements Index starts at 0
Subset Install Python
Variable Assignment
>>> my_list[1] Select item at index 1
>>> x=5
>>> my_list[-3] Select 3rd last item
>>> x
Slice
5 >>> my_list[1:3] Select items at index 1 and 2
Calculations With Variables >>> my_list[1:] Select items after index 0
>>> my_list[:3] Select items before index 3 Leading open data science platform Free IDE that is included Create and share
>>> x+2 Sum of two variables
>>> my_list[:] Copy my_list powered by Python with Anaconda documents with live code,
7 visualizations, text, ...
>>> x-2 Subtraction of two variables
Subset Lists of Lists
>>> my_list2[1][0] my_list[list][itemOfList]
3
>>> my_list2[1][:2] Numpy Arrays Also see Lists
>>> x*2 Multiplication of two variables
>>> my_list = [1, 2, 3, 4]
10 List Operations >>> my_array = np.array(my_list)
>>> x**2 Exponentiation of a variable
25 >>> my_list + my_list >>> my_2darray = np.array([[1,2,3],[4,5,6]])
>>> x%2 Remainder of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']
Selecting Numpy Array Elements Index starts at 0
1 >>> my_list * 2
>>> x/float(2) Division of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] Subset
2.5 >>> my_list2 > 4 >>> my_array[1] Select item at index 1
True 2
Types and Type Conversion Slice
List Methods >>> my_array[0:2] Select items at index 0 and 1
str() '5', '3.45', 'True' Variables to strings
my_list.index(a) Get the index of an item array([1, 2])
>>>
int() 5, 3, 1 Variables to integers >>> my_list.count(a) Count an item Subset 2D Numpy arrays
>>> my_list.append('!') Append an item at a time >>> my_2darray[:,0] my_2darray[rows, columns]
my_list.remove('!') Remove an item array([1, 4])
float() 5.0, 1.0 Variables to floats >>>
>>> del(my_list[0:1]) Remove an item Numpy Array Operations
bool() True, True, True >>> my_list.reverse() Reverse the list
Variables to booleans >>> my_array > 3
>>> my_list.extend('!') Append an item array([False, False, False, True], dtype=bool)
>>> my_list.pop(-1) Remove an item >>> my_array * 2
Asking For Help >>> my_list.insert(0,'!') Insert an item array([2, 4, 6, 8])
>>> help(str) >>> my_list.sort() Sort the list >>> my_array + np.array([5, 6, 7, 8])
array([6, 8, 10, 12])
Strings
>>> my_string = 'thisStringIsAwesome' Numpy Array Functions
String Operations Index starts at 0
>>> my_string >>> my_array.shape Get the dimensions of the array
'thisStringIsAwesome' >>> my_string[3] >>> np.append(other_array) Append items to an array
>>> my_string[4:9] >>> np.insert(my_array, 1, 5) Insert items in an array
String Operations >>> np.delete(my_array,[1]) Delete items in an array
String Methods >>> np.mean(my_array) Mean of the array
>>> my_string * 2
'thisStringIsAwesomethisStringIsAwesome' >>> my_string.upper() String to uppercase >>> np.median(my_array) Median of the array
>>> my_string + 'Innit' >>> my_string.lower() String to lowercase >>> my_array.corrcoef() Correlation coefficient
'thisStringIsAwesomeInnit' >>> my_string.count('w') Count String elements >>> np.std(my_array) Standard deviation
>>> 'm' in my_string >>> my_string.replace('e', 'i') Replace String elements
True >>> my_string.strip() Strip whitespaces DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Plot Anatomy & Workflow
Plot Anatomy Workflow
Matplotlib Axes/Subplot The basic steps to creating plots with matplotlib are:
Learn Python Interactively at www.DataCamp.com 1 Prepare data 2 Create plot 3 Plot 4 Customize plot 5 Save plot 6 Show plot
>>> import matplotlib.pyplot as plt
>>> x = [1,2,3,4] Step 1
>>> y = [10,20,25,30]
>>> fig = plt.figure() Step 2
Matplotlib Y-axis Figure >>> ax = fig.add_subplot(111) Step 3
>>> ax.plot(x, y, color='lightblue', linewidth=3) Step 3, 4
Matplotlib is a Python 2D plotting library which produces >>> ax.scatter([2,4,6],
publication-quality figures in a variety of hardcopy formats [5,15,25],
color='darkgreen',
and interactive environments across marker='^')
platforms. >>> ax.set_xlim(1, 6.5)
X-axis
>>> plt.savefig('foo.png')

1 Prepare The Data Also see Lists & NumPy


>>> plt.show() Step 6

1D Data 4 Customize Plot


>>> import numpy as np Colors, Color Bars & Color Maps Mathtext
>>> x = np.linspace(0, 10, 100)
>>> y = np.cos(x) >>> plt.plot(x, x, x, x**2, x, x**3) >>> plt.title(r'$sigma_i=15$', fontsize=20)
>>> z = np.sin(x) >>> ax.plot(x, y, alpha = 0.4)
>>> ax.plot(x, y, c='k') Limits, Legends & Layouts
2D Data or Images >>> fig.colorbar(im, orientation='horizontal')
>>> im = ax.imshow(img, Limits & Autoscaling
>>> data = 2 * np.random.random((10, 10)) cmap='seismic')
>>> data2 = 3 * np.random.random((10, 10)) >>> ax.margins(x=0.0,y=0.1) Add padding to a plot
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j] >>> ax.axis('equal') Set the aspect ratio of the plot to 1
Markers >>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) Set limits for x-and y-axis
>>> U = -1 - X**2 + Y
>>> V = 1 + X - Y**2 >>> fig, ax = plt.subplots() >>> ax.set_xlim(0,10.5) Set limits for x-axis
>>> from matplotlib.cbook import get_sample_data >>> ax.scatter(x,y,marker=".") Legends
>>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy')) >>> ax.plot(x,y,marker="o") >>> ax.set(title='An Example Axes', Set a title and x-and y-axis labels
ylabel='Y-Axis',
Linestyles xlabel='X-Axis')
2 Create Plot >>>
>>>
plt.plot(x,y,linewidth=4.0)
plt.plot(x,y,ls='solid')
>>> ax.legend(loc='best')
Ticks
No overlapping plot elements

>>> import matplotlib.pyplot as plt >>> ax.xaxis.set(ticks=range(1,5), Manually set x-ticks


>>> plt.plot(x,y,ls='--') ticklabels=[3,100,-12,"foo"])
Figure >>> plt.plot(x,y,'--',x**2,y**2,'-.') >>> ax.tick_params(axis='y', Make y-ticks longer and go in and out
>>> plt.setp(lines,color='r',linewidth=4.0) direction='inout',
>>> fig = plt.figure() length=10)
>>> fig2 = plt.figure(figsize=plt.figaspect(2.0)) Text & Annotations
Subplot Spacing
Axes >>> ax.text(1, >>> fig3.subplots_adjust(wspace=0.5, Adjust the spacing between subplots
-2.1, hspace=0.3,
All plotting is done with respect to an Axes. In most cases, a 'Example Graph', left=0.125,
style='italic') right=0.9,
subplot will fit your needs. A subplot is an axes on a grid system. >>> ax.annotate("Sine", top=0.9,
>>> fig.add_axes() xy=(8, 0), bottom=0.1)
>>> ax1 = fig.add_subplot(221) # row-col-num xycoords='data', >>> fig.tight_layout() Fit subplot(s) in to the figure area
xytext=(10.5, 0),
>>> ax3 = fig.add_subplot(212) textcoords='data', Axis Spines
>>> fig3, axes = plt.subplots(nrows=2,ncols=2) arrowprops=dict(arrowstyle="->", >>> ax1.spines['top'].set_visible(False) Make the top axis line for a plot invisible
>>> fig4, axes2 = plt.subplots(ncols=3) connectionstyle="arc3"),) >>> ax1.spines['bottom'].set_position(('outward',10)) Move the bottom axis line outward

3 Plotting Routines 5 Save Plot


1D Data Vector Fields Save figures
>>> plt.savefig('foo.png')
>>> fig, ax = plt.subplots() >>> axes[0,1].arrow(0,0,0.5,0.5) Add an arrow to the axes
>>> lines = ax.plot(x,y) Draw points with lines or markers connecting them >>> axes[1,1].quiver(y,z) Plot a 2D field of arrows Save transparent figures
>>> ax.scatter(x,y) Draw unconnected points, scaled or colored >>> axes[0,1].streamplot(X,Y,U,V) Plot a 2D field of arrows >>> plt.savefig('foo.png', transparent=True)
>>> axes[0,0].bar([1,2,3],[3,4,5]) Plot vertical rectangles (constant width)
>>>
>>>
>>>
axes[1,0].barh([0.5,1,2.5],[0,1,2])
axes[1,1].axhline(0.45)
axes[0,1].axvline(0.65)
Plot horiontal rectangles (constant height)
Draw a horizontal line across axes
Draw a vertical line across axes
Data Distributions
>>> ax1.hist(y) Plot a histogram
6 Show Plot
>>> plt.show()
>>> ax.fill(x,y,color='blue') Draw filled polygons >>> ax3.boxplot(y) Make a box and whisker plot
>>> ax.fill_between(x,y,color='yellow') Fill between y-values and 0 >>> ax3.violinplot(z) Make a violin plot
2D Data or Images Close & Clear
>>> fig, ax = plt.subplots() >>> plt.cla() Clear an axis
>>> axes2[0].pcolor(data2) Pseudocolor plot of 2D array >>> plt.clf() Clear the entire figure
>>> im = ax.imshow(img, Colormapped or RGB arrays >>> axes2[0].pcolormesh(data) Pseudocolor plot of 2D array
cmap='gist_earth', >>> plt.close() Close a window
interpolation='nearest', >>> CS = plt.contour(Y,X,U) Plot contours
vmin=-2, >>> axes2[2].contourf(data1) Plot filled contours
vmax=2) >>> axes2[2]= ax.clabel(CS) Label a contour plot DataCamp
Learn Python for Data Science Interactively
Matplotlib 2.0.0 - Updated on: 02/2017
Python For Data Science Cheat Sheet Excel Spreadsheets Pickled Files
>>> file = 'urbanpop.xlsx' >>> import pickle
Importing Data >>> data = pd.ExcelFile(file) >>> with open('pickled_fruit.pkl', 'rb') as file:
pickled_data = pickle.load(file)
>>> df_sheet2 = data.parse('1960-1966',
Learn Python for data science Interactively at www.DataCamp.com skiprows=[0],
names=['Country',
'AAM: War(2002)'])
>>> df_sheet1 = data.parse(0, HDF5 Files
parse_cols=[0],
Importing Data in Python skiprows=[0], >>> import h5py
>>> filename = 'H-H1_LOSC_4_v1-815411200-4096.hdf5'
names=['Country'])
Most of the time, you’ll use either NumPy or pandas to import >>> data = h5py.File(filename, 'r')
your data: To access the sheet names, use the sheet_names attribute:
>>> import numpy as np >>> data.sheet_names
>>> import pandas as pd Matlab Files
Help SAS Files >>> import scipy.io
>>> filename = 'workspace.mat'
>>> from sas7bdat import SAS7BDAT >>> mat = scipy.io.loadmat(filename)
>>> np.info(np.ndarray.dtype)
>>> help(pd.read_csv) >>> with SAS7BDAT('urbanpop.sas7bdat') as file:
df_sas = file.to_data_frame()

Text Files Exploring Dictionaries


Stata Files Accessing Elements with Functions
Plain Text Files >>> data = pd.read_stata('urbanpop.dta') >>> print(mat.keys()) Print dictionary keys
>>> filename = 'huck_finn.txt' >>> for key in data.keys(): Print dictionary keys
>>> file = open(filename, mode='r') Open the file for reading print(key)
>>> text = file.read() Read a file’s contents Relational Databases meta
quality
>>> print(file.closed) Check whether file is closed
>>> from sqlalchemy import create_engine strain
>>> file.close() Close file
>>> print(text) >>> engine = create_engine('sqlite://Northwind.sqlite') >>> pickled_data.values() Return dictionary values
>>> print(mat.items()) Returns items in list format of (key, value)
Use the table_names() method to fetch a list of table names: tuple pairs
Using the context manager with
>>> with open('huck_finn.txt', 'r') as file:
>>> table_names = engine.table_names() Accessing Data Items with Keys
print(file.readline()) Read a single line
print(file.readline()) Querying Relational Databases >>> for key in data ['meta'].keys() Explore the HDF5 structure
print(file.readline()) print(key)
>>> con = engine.connect() Description
>>> rs = con.execute("SELECT * FROM Orders") DescriptionURL
Table Data: Flat Files >>> df = pd.DataFrame(rs.fetchall()) Detector
>>> df.columns = rs.keys() Duration
GPSstart
Importing Flat Files with numpy >>> con.close()
Observatory
Files with one data type Using the context manager with Type
UTCstart
>>> filename = ‘mnist.txt’ >>> with engine.connect() as con:
>>> print(data['meta']['Description'].value) Retrieve the value for a key
>>> data = np.loadtxt(filename, rs = con.execute("SELECT OrderID FROM Orders")
delimiter=',', String used to separate values df = pd.DataFrame(rs.fetchmany(size=5))
df.columns = rs.keys()
skiprows=2,
usecols=[0,2],
Skip the first 2 lines
Read the 1st and 3rd column
Navigating Your FileSystem
dtype=str) The type of the resulting array Querying relational databases with pandas
Magic Commands
Files with mixed data types >>> df = pd.read_sql_query("SELECT * FROM Orders", engine)
>>> filename = 'titanic.csv' !ls List directory contents of files and directories
>>> data = np.genfromtxt(filename, %cd .. Change current working directory
%pwd Return the current working directory path
delimiter=',',
names=True, Look for column header
Exploring Your Data
dtype=None)
NumPy Arrays os Library
>>> data_array = np.recfromcsv(filename) >>> data_array.dtype Data type of array elements >>> import os
>>> data_array.shape Array dimensions >>> path = "/usr/tmp"
The default dtype of the np.recfromcsv() function is None. >>> wd = os.getcwd() Store the name of current directory in a string
>>> len(data_array) Length of array
>>> os.listdir(wd) Output contents of the directory in a list
Importing Flat Files with pandas >>> os.chdir(path) Change current working directory
pandas DataFrames >>> os.rename("test1.txt", Rename a file
>>> filename = 'winequality-red.csv' "test2.txt")
>>> data = pd.read_csv(filename, >>> df.head() Return first DataFrame rows
nrows=5, >>> os.remove("test1.txt") Delete an existing file
Number of rows of file to read >>> df.tail() Return last DataFrame rows >>> os.mkdir("newdir") Create a new directory
header=None, Row number to use as col names >>> df.index Describe index
sep='\t', Delimiter to use >>> df.columns Describe DataFrame columns
comment='#', Character to split comments >>> df.info() Info on DataFrame
na_values=[""]) String to recognize as NA/NaN >>> data_array = data.values Convert a DataFrame to an a NumPy array DataCamp
Learn R for Data Science Interactively
Working with Different Programming Languages Widgets
Python For Data Science Cheat Sheet Kernels provide computation and communication with front-end interfaces Notebook widgets provide the ability to visualize and control changes
Jupyter Notebook like the notebooks. There are three main kernels: in your data, often as a control like a slider, textbox, etc.
Learn More Python for Data Science Interactively at www.DataCamp.com
You can use them to build interactive GUIs for your notebooks or to
IRkernel IJulia
synchronize stateful and stateless information between Python and
Installing Jupyter Notebook will automatically install the IPython kernel. JavaScript.
Saving/Loading Notebooks Restart kernel Interrupt kernel
Create new notebook Restart kernel & run Interrupt kernel & Download serialized Save notebook
all cells clear all output state of all widget with interactive
Open an existing
Connect back to a models in use widgets
Make a copy of the notebook Restart kernel & run remote notebook
current notebook all cells Embed current
Rename notebook Run other installed
widgets
kernels
Revert notebook to a
Save current notebook
previous checkpoint Command Mode:
and record checkpoint
Download notebook as
Preview of the printed - IPython notebook 15
notebook - Python
- HTML
Close notebook & stop - Markdown 13 14
- reST
running any scripts - LaTeX 1 2 3 4 5 6 7 8 9 10 11 12
- PDF

Writing Code And Text


Code and text are encapsulated by 3 basic cell types: markdown cells, code
cells, and raw NBConvert cells.
Edit Cells Edit Mode: 1. Save and checkpoint 9. Interrupt kernel
2. Insert cell below 10. Restart kernel
3. Cut cell 11. Display characteristics
Cut currently selected cells Copy cells from 4. Copy cell(s) 12. Open command palette
to clipboard clipboard to current 5. Paste cell(s) below 13. Current kernel
cursor position 6. Move cell up 14. Kernel status
Paste cells from Executing Cells 7. Move cell down 15. Log out from notebook server
clipboard above Paste cells from 8. Run current cell
current cell Run selected cell(s) Run current cells down
clipboard below
and create a new one
Paste cells from current cell
below Asking For Help
clipboard on top Run current cells down
Delete current cells
of current cel and create a new one Walk through a UI tour
Split up a cell from above Run all cells
Revert “Delete Cells” List of built-in keyboard
current cursor Run all cells above the Run all cells below
invocation shortcuts
position current cell the current cell Edit the built-in
Merge current cell Merge current cell keyboard shortcuts
Change the cell type of toggle, toggle Notebook help topics
with the one above with the one below current cell scrolling and clear Description of
Move current cell up Move current cell toggle, toggle current outputs markdown available Information on
down scrolling and clear in notebook unofficial Jupyter
Adjust metadata
underlying the Find and replace all output Notebook extensions
Python help topics
current notebook in selected cells IPython help topics
View Cells
Remove cell Copy attachments of NumPy help topics
attachments current cell Toggle display of Jupyter SciPy help topics
Toggle display of toolbar Matplotlib help topics
Paste attachments of Insert image in logo and filename
SymPy help topics
current cell selected cells Toggle display of cell Pandas help topics
action icons:
Insert Cells - None About Jupyter Notebook
- Edit metadata
Toggle line numbers - Raw cell format
Add new cell above the Add new cell below the - Slideshow
current one in cells - Attachments
current one DataCamp
- Tags
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Model Architecture Inspect Model
>>> model.output_shape
Sequential Model Model output shape
Keras >>> from keras.models import Sequential
>>>
>>>
model.summary()
model.get_config()
Model summary representation
Model configuration
Learn Python for data science Interactively at www.DataCamp.com >>> model = Sequential() >>> model.get_weights() List all weight tensors in the model
>>> model2 = Sequential()
>>> model3 = Sequential() Compile Model
Multilayer Perceptron (MLP) MLP: Binary Classification
Keras Binary Classification >>> model.compile(optimizer='adam',
loss='binary_crossentropy',
Keras is a powerful and easy-to-use deep learning library for >>> from keras.layers import Dense metrics=['accuracy'])
Theano and TensorFlow that provides a high-level neural >>> model.add(Dense(12, MLP: Multi-Class Classification
input_dim=8, >>> model.compile(optimizer='rmsprop',
networks API to develop and evaluate deep learning models. kernel_initializer='uniform', loss='categorical_crossentropy',
activation='relu')) metrics=['accuracy'])
A Basic Example >>> model.add(Dense(8,kernel_initializer='uniform',activation='relu'))
MLP: Regression
>>> model.add(Dense(1,kernel_initializer='uniform',activation='sigmoid')) >>> model.compile(optimizer='rmsprop',
>>> import numpy as np loss='mse',
>>> from keras.models import Sequential Multi-Class Classification metrics=['mae'])
>>> from keras.layers import Dense >>> from keras.layers import Dropout
>>> data = np.random.random((1000,100)) >>> model.add(Dense(512,activation='relu',input_shape=(784,))) Recurrent Neural Network
>>> labels = np.random.randint(2,size=(1000,1)) >>> model.add(Dropout(0.2)) >>> model3.compile(loss='binary_crossentropy',
>>> model = Sequential() optimizer='adam',
>>> model.add(Dense(512,activation='relu')) metrics=['accuracy'])
>>> model.add(Dense(32, >>> model.add(Dropout(0.2))
activation='relu', >>> model.add(Dense(10,activation='softmax'))

>>>
input_dim=100))
model.add(Dense(1, activation='sigmoid'))
Regression Model Training
>>> model.compile(optimizer='rmsprop', >>> model.add(Dense(64,activation='relu',input_dim=train_data.shape[1])) >>> model3.fit(x_train4,
loss='binary_crossentropy', >>> model.add(Dense(1)) y_train4,
metrics=['accuracy']) batch_size=32,
>>> model.fit(data,labels,epochs=10,batch_size=32) Convolutional Neural Network (CNN) epochs=15,
verbose=1,
>>> predictions = model.predict(data) >>> from keras.layers import Activation,Conv2D,MaxPooling2D,Flatten validation_data=(x_test4,y_test4))
>>> model2.add(Conv2D(32,(3,3),padding='same',input_shape=x_train.shape[1:]))
Data Also see NumPy, Pandas & Scikit-Learn >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(32,(3,3))) Evaluate Your Model's Performance
Your data needs to be stored as NumPy arrays or as a list of NumPy arrays. Ide- >>> model2.add(Activation('relu')) >>> score = model3.evaluate(x_test,
>>> model2.add(MaxPooling2D(pool_size=(2,2))) y_test,
ally, you split the data in training and test sets, for which you can also resort batch_size=32)
>>> model2.add(Dropout(0.25))
to the train_test_split module of sklearn.cross_validation.
>>> model2.add(Conv2D(64,(3,3), padding='same'))
Keras Data Sets >>>
>>>
model2.add(Activation('relu'))
model2.add(Conv2D(64,(3, 3)))
Prediction
>>> from keras.datasets import boston_housing, >>> model2.add(Activation('relu')) >>> model3.predict(x_test4, batch_size=32)
mnist, >>> model2.add(MaxPooling2D(pool_size=(2,2))) >>> model3.predict_classes(x_test4,batch_size=32)
cifar10, >>> model2.add(Dropout(0.25))
imdb
>>> (x_train,y_train),(x_test,y_test) = mnist.load_data()
>>> (x_train2,y_train2),(x_test2,y_test2) = boston_housing.load_data()
>>>
>>>
model2.add(Flatten())
model2.add(Dense(512))
Save/ Reload Models
>>> (x_train3,y_train3),(x_test3,y_test3) = cifar10.load_data() >>> model2.add(Activation('relu')) >>> from keras.models import load_model
>>> (x_train4,y_train4),(x_test4,y_test4) = imdb.load_data(num_words=20000) >>> model2.add(Dropout(0.5)) >>> model3.save('model_file.h5')
>>> num_classes = 10 >>> my_model = load_model('my_model.h5')
>>> model2.add(Dense(num_classes))
>>> model2.add(Activation('softmax'))
Other
Recurrent Neural Network (RNN) Model Fine-tuning
>>> from urllib.request import urlopen
>>> data = np.loadtxt(urlopen("http://archive.ics.uci.edu/
ml/machine-learning-databases/pima-indians-diabetes/
>>> from keras.klayers import Embedding,LSTM Optimization Parameters
pima-indians-diabetes.data"),delimiter=",") >>> model3.add(Embedding(20000,128)) >>> from keras.optimizers import RMSprop
>>> X = data[:,0:8] >>> model3.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2)) >>> opt = RMSprop(lr=0.0001, decay=1e-6)
>>> y = data [:,8] >>> model3.add(Dense(1,activation='sigmoid')) >>> model2.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
Preprocessing Also see NumPy & Scikit-Learn
Early Stopping
Sequence Padding Train and Test Sets >>> from keras.callbacks import EarlyStopping
>>> from keras.preprocessing import sequence >>> from sklearn.model_selection import train_test_split >>> early_stopping_monitor = EarlyStopping(patience=2)
>>> x_train4 = sequence.pad_sequences(x_train4,maxlen=80) >>> X_train5,X_test5,y_train5,y_test5 = train_test_split(X, >>> model3.fit(x_train4,
>>> x_test4 = sequence.pad_sequences(x_test4,maxlen=80) y,
test_size=0.33, y_train4,
random_state=42) batch_size=32,
One-Hot Encoding epochs=15,
>>> from keras.utils import to_categorical Standardization/Normalization validation_data=(x_test4,y_test4),
>>> Y_train = to_categorical(y_train, num_classes) >>> from sklearn.preprocessing import StandardScaler callbacks=[early_stopping_monitor])
>>> Y_test = to_categorical(y_test, num_classes) >>> scaler = StandardScaler().fit(x_train2)
>>> Y_train3 = to_categorical(y_train3, num_classes) >>> standardized_X = scaler.transform(x_train2) DataCamp
>>> Y_test3 = to_categorical(y_test3, num_classes) >>> standardized_X_test = scaler.transform(x_test2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Plot Anatomy & Workflow
Plot Anatomy Workflow
Matplotlib Axes/Subplot The basic steps to creating plots with matplotlib are:
Learn Python Interactively at www.DataCamp.com 1 Prepare data 2 Create plot 3 Plot 4 Customize plot 5 Save plot 6 Show plot
>>> import matplotlib.pyplot as plt
>>> x = [1,2,3,4] Step 1
>>> y = [10,20,25,30]
>>> fig = plt.figure() Step 2
Matplotlib Y-axis Figure >>> ax = fig.add_subplot(111) Step 3
>>> ax.plot(x, y, color='lightblue', linewidth=3) Step 3, 4
Matplotlib is a Python 2D plotting library which produces >>> ax.scatter([2,4,6],
publication-quality figures in a variety of hardcopy formats [5,15,25],
color='darkgreen',
and interactive environments across marker='^')
platforms. >>> ax.set_xlim(1, 6.5)
X-axis
>>> plt.savefig('foo.png')

1 Prepare The Data Also see Lists & NumPy


>>> plt.show() Step 6

1D Data 4 Customize Plot


>>> import numpy as np Colors, Color Bars & Color Maps Mathtext
>>> x = np.linspace(0, 10, 100)
>>> y = np.cos(x) >>> plt.plot(x, x, x, x**2, x, x**3) >>> plt.title(r'$sigma_i=15$', fontsize=20)
>>> z = np.sin(x) >>> ax.plot(x, y, alpha = 0.4)
>>> ax.plot(x, y, c='k') Limits, Legends & Layouts
2D Data or Images >>> fig.colorbar(im, orientation='horizontal')
>>> im = ax.imshow(img, Limits & Autoscaling
>>> data = 2 * np.random.random((10, 10)) cmap='seismic')
>>> data2 = 3 * np.random.random((10, 10)) >>> ax.margins(x=0.0,y=0.1) Add padding to a plot
>>> Y, X = np.mgrid[-3:3:100j, -3:3:100j] >>> ax.axis('equal') Set the aspect ratio of the plot to 1
Markers >>> ax.set(xlim=[0,10.5],ylim=[-1.5,1.5]) Set limits for x-and y-axis
>>> U = -1 - X**2 + Y
>>> V = 1 + X - Y**2 >>> fig, ax = plt.subplots() >>> ax.set_xlim(0,10.5) Set limits for x-axis
>>> from matplotlib.cbook import get_sample_data >>> ax.scatter(x,y,marker=".") Legends
>>> img = np.load(get_sample_data('axes_grid/bivariate_normal.npy')) >>> ax.plot(x,y,marker="o") >>> ax.set(title='An Example Axes', Set a title and x-and y-axis labels
ylabel='Y-Axis',
Linestyles xlabel='X-Axis')
2 Create Plot >>>
>>>
plt.plot(x,y,linewidth=4.0)
plt.plot(x,y,ls='solid')
>>> ax.legend(loc='best')
Ticks
No overlapping plot elements

>>> import matplotlib.pyplot as plt >>> ax.xaxis.set(ticks=range(1,5), Manually set x-ticks


>>> plt.plot(x,y,ls='--') ticklabels=[3,100,-12,"foo"])
Figure >>> plt.plot(x,y,'--',x**2,y**2,'-.') >>> ax.tick_params(axis='y', Make y-ticks longer and go in and out
>>> plt.setp(lines,color='r',linewidth=4.0) direction='inout',
>>> fig = plt.figure() length=10)
>>> fig2 = plt.figure(figsize=plt.figaspect(2.0)) Text & Annotations
Subplot Spacing
Axes >>> ax.text(1, >>> fig3.subplots_adjust(wspace=0.5, Adjust the spacing between subplots
-2.1, hspace=0.3,
All plotting is done with respect to an Axes. In most cases, a 'Example Graph', left=0.125,
style='italic') right=0.9,
subplot will fit your needs. A subplot is an axes on a grid system. >>> ax.annotate("Sine", top=0.9,
>>> fig.add_axes() xy=(8, 0), bottom=0.1)
>>> ax1 = fig.add_subplot(221) # row-col-num xycoords='data', >>> fig.tight_layout() Fit subplot(s) in to the figure area
xytext=(10.5, 0),
>>> ax3 = fig.add_subplot(212) textcoords='data', Axis Spines
>>> fig3, axes = plt.subplots(nrows=2,ncols=2) arrowprops=dict(arrowstyle="->", >>> ax1.spines['top'].set_visible(False) Make the top axis line for a plot invisible
>>> fig4, axes2 = plt.subplots(ncols=3) connectionstyle="arc3"),) >>> ax1.spines['bottom'].set_position(('outward',10)) Move the bottom axis line outward

3 Plotting Routines 5 Save Plot


1D Data Vector Fields Save figures
>>> plt.savefig('foo.png')
>>> fig, ax = plt.subplots() >>> axes[0,1].arrow(0,0,0.5,0.5) Add an arrow to the axes
>>> lines = ax.plot(x,y) Draw points with lines or markers connecting them >>> axes[1,1].quiver(y,z) Plot a 2D field of arrows Save transparent figures
>>> ax.scatter(x,y) Draw unconnected points, scaled or colored >>> axes[0,1].streamplot(X,Y,U,V) Plot a 2D field of arrows >>> plt.savefig('foo.png', transparent=True)
>>> axes[0,0].bar([1,2,3],[3,4,5]) Plot vertical rectangles (constant width)
>>>
>>>
>>>
axes[1,0].barh([0.5,1,2.5],[0,1,2])
axes[1,1].axhline(0.45)
axes[0,1].axvline(0.65)
Plot horiontal rectangles (constant height)
Draw a horizontal line across axes
Draw a vertical line across axes
Data Distributions
>>> ax1.hist(y) Plot a histogram
6 Show Plot
>>> plt.show()
>>> ax.fill(x,y,color='blue') Draw filled polygons >>> ax3.boxplot(y) Make a box and whisker plot
>>> ax.fill_between(x,y,color='yellow') Fill between y-values and 0 >>> ax3.violinplot(z) Make a violin plot
2D Data or Images Close & Clear
>>> fig, ax = plt.subplots() >>> plt.cla() Clear an axis
>>> axes2[0].pcolor(data2) Pseudocolor plot of 2D array >>> plt.clf() Clear the entire figure
>>> im = ax.imshow(img, Colormapped or RGB arrays >>> axes2[0].pcolormesh(data) Pseudocolor plot of 2D array
cmap='gist_earth', >>> plt.close() Close a window
interpolation='nearest', >>> CS = plt.contour(Y,X,U) Plot contours
vmin=-2, >>> axes2[2].contourf(data1) Plot filled contours
vmax=2) >>> axes2[2]= ax.clabel(CS) Label a contour plot DataCamp
Learn Python for Data Science Interactively
Matplotlib 2.0.0 - Updated on: 02/2017
Python For Data Science Cheat Sheet Inspecting Your Array Subsetting, Slicing, Indexing Also see Lists
>>> a.shape Array dimensions Subsetting
NumPy Basics >>>
>>>
len(a)
b.ndim
Length of array
Number of array dimensions
>>> a[2]
3
1 2 3 Select the element at the 2nd index
Learn Python for Data Science Interactively at www.DataCamp.com >>> e.size Number of array elements >>> b[1,2] 1.5 2 3 Select the element at row 1 column 2
>>> b.dtype Data type of array elements 6.0 4 5 6 (equivalent to b[1][2])
>>> b.dtype.name Name of data type
>>> b.astype(int) Convert an array to a different type Slicing
NumPy >>> a[0:2]
array([1, 2])
1 2 3 Select items at index 0 and 1
2
The NumPy library is the core library for scientific computing in Asking For Help >>> b[0:2,1] 1.5 2 3 Select items at rows 0 and 1 in column 1
>>> np.info(np.ndarray.dtype) array([ 2., 5.]) 4 5 6
Python. It provides a high-performance multidimensional array
Array Mathematics
1.5 2 3
>>> b[:1] Select all items at row 0
object, and tools for working with these arrays. array([[1.5, 2., 3.]]) 4 5 6 (equivalent to b[0:1, :])
Arithmetic Operations >>> c[1,...] Same as [1,:,:]
Use the following import convention: array([[[ 3., 2., 1.],
>>> import numpy as np [ 4., 5., 6.]]])
>>> g = a - b Subtraction
array([[-0.5, 0. , 0. ], >>> a[ : :-1] Reversed array a
NumPy Arrays [-3. , -3. , -3. ]])
array([3, 2, 1])

>>> np.subtract(a,b) Boolean Indexing


1D array 2D array 3D array Subtraction
>>> a[a<2] Select elements from a less than 2
>>> b + a Addition 1 2 3
array([[ 2.5, 4. , 6. ], array([1])
axis 1 axis 2
1 2 3 axis 1 [ 5. , 7. , 9. ]]) Fancy Indexing
1.5 2 3 >>> np.add(b,a) Addition >>> b[[1, 0, 1, 0],[0, 1, 2, 0]] Select elements (1,0),(0,1),(1,2) and (0,0)
axis 0 axis 0 array([ 4. , 2. , 6. , 1.5])
4 5 6 >>> a / b Division
array([[ 0.66666667, 1. , 1. ], >>> b[[1, 0, 1, 0]][:,[0,1,2,0]] Select a subset of the matrix’s rows
[ 0.25 , 0.4 , 0.5 ]]) array([[ 4. ,5. , 6. , 4. ], and columns
>>> np.divide(a,b) Division [ 1.5, 2. , 3. , 1.5],
Creating Arrays >>> a * b
array([[ 1.5, 4. , 9. ],
Multiplication
[ 4. , 5.
[ 1.5, 2.
,
,
6.
3.
,
,
4. ],
1.5]])

>>> a = np.array([1,2,3]) [ 4. , 10. , 18. ]])


>>> b = np.array([(1.5,2,3), (4,5,6)], dtype = float) >>> np.multiply(a,b) Multiplication Array Manipulation
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]], >>> np.exp(b) Exponentiation
dtype = float) >>> np.sqrt(b) Square root Transposing Array
>>> np.sin(a) Print sines of an array >>> i = np.transpose(b) Permute array dimensions
Initial Placeholders >>> np.cos(b) Element-wise cosine >>> i.T Permute array dimensions
>>> np.log(a) Element-wise natural logarithm
>>> np.zeros((3,4)) Create an array of zeros >>> e.dot(f) Dot product
Changing Array Shape
>>> np.ones((2,3,4),dtype=np.int16) Create an array of ones array([[ 7., 7.], >>> b.ravel() Flatten the array
>>> d = np.arange(10,25,5) Create an array of evenly [ 7., 7.]]) >>> g.reshape(3,-2) Reshape, but don’t change data
spaced values (step value)
>>> np.linspace(0,2,9) Create an array of evenly Comparison Adding/Removing Elements
spaced values (number of samples) >>> h.resize((2,6)) Return a new array with shape (2,6)
>>> e = np.full((2,2),7) Create a constant array >>> a == b Element-wise comparison >>> np.append(h,g) Append items to an array
>>> f = np.eye(2) Create a 2X2 identity matrix array([[False, True, True], >>> np.insert(a, 1, 5) Insert items in an array
>>> np.random.random((2,2)) Create an array with random values [False, False, False]], dtype=bool) >>> np.delete(a,[1]) Delete items from an array
>>> np.empty((3,2)) Create an empty array >>> a < 2 Element-wise comparison
array([True, False, False], dtype=bool) Combining Arrays
>>> np.array_equal(a, b) Array-wise comparison >>> np.concatenate((a,d),axis=0) Concatenate arrays
I/O array([ 1, 2,
>>> np.vstack((a,b))
3, 10, 15, 20])
Stack arrays vertically (row-wise)
Aggregate Functions array([[ 1. , 2. , 3. ],
Saving & Loading On Disk [ 1.5, 2. , 3. ],
>>> a.sum() Array-wise sum [ 4. , 5. , 6. ]])
>>> np.save('my_array', a) >>> a.min() Array-wise minimum value >>> np.r_[e,f] Stack arrays vertically (row-wise)
>>> np.savez('array.npz', a, b) >>> b.max(axis=0) Maximum value of an array row >>> np.hstack((e,f)) Stack arrays horizontally (column-wise)
>>> np.load('my_array.npy') >>> b.cumsum(axis=1) Cumulative sum of the elements array([[ 7., 7., 1., 0.],
>>> a.mean() Mean [ 7., 7., 0., 1.]])
Saving & Loading Text Files >>> b.median() Median >>> np.column_stack((a,d)) Create stacked column-wise arrays
>>> np.loadtxt("myfile.txt") >>> a.corrcoef() Correlation coefficient array([[ 1, 10],
>>> np.std(b) Standard deviation [ 2, 15],
>>> np.genfromtxt("my_file.csv", delimiter=',') [ 3, 20]])
>>> np.savetxt("myarray.txt", a, delimiter=" ") >>> np.c_[a,d] Create stacked column-wise arrays
Copying Arrays Splitting Arrays
Data Types >>> h = a.view() Create a view of the array with the same data >>> np.hsplit(a,3) Split the array horizontally at the 3rd
>>> np.copy(a) Create a copy of the array [array([1]),array([2]),array([3])] index
>>> np.int64 Signed 64-bit integer types >>> np.vsplit(c,2) Split the array vertically at the 2nd index
>>> np.float32 Standard double-precision floating point >>> h = a.copy() Create a deep copy of the array [array([[[ 1.5, 2. , 1. ],
>>> np.complex Complex numbers represented by 128 floats [ 4. , 5. , 6. ]]]),
array([[[ 3., 2., 3.],
>>>
>>>
np.bool
np.object
Boolean type storing TRUE and FALSE values
Python object type Sorting Arrays [ 4., 5., 6.]]])]

>>> np.string_ Fixed-length string type >>> a.sort() Sort an array


>>> np.unicode_ Fixed-length unicode type >>> c.sort(axis=0) Sort the elements of an array's axis DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Advanced Indexing Also see NumPy Arrays Combining Data
Selecting data1 data2
Pandas >>> df3.loc[:,(df3>1).any()] Select cols with any vals >1 X1 X2 X1 X3
Learn Python for Data Science Interactively at www.DataCamp.com >>> df3.loc[:,(df3>1).all()] Select cols with vals > 1
>>> df3.loc[:,df3.isnull().any()] Select cols with NaN a 11.432 a 20.784
>>> df3.loc[:,df3.notnull().all()] Select cols without NaN b 1.303 b NaN
Indexing With isin c 99.906 d 20.784
>>> df[(df.Country.isin(df2.Type))] Find same elements
Reshaping Data >>> df3.filter(items=”a”,”b”]) Filter on values
Merge
>>> df.select(lambda x: not x%5) Select specific elements
Pivot Where X1 X2 X3
>>> pd.merge(data1,
>>> df3= df2.pivot(index='Date', Spread rows into columns >>> s.where(s > 0) Subset the data data2, a 11.432 20.784
columns='Type', Query how='left',
values='Value') b 1.303 NaN
>>> df6.query('second > first') Query DataFrame on='X1')
c 99.906 NaN
Date Type Value

0 2016-03-01 a 11.432 Type a b c Setting/Resetting Index >>> pd.merge(data1, X1 X2 X3


1 2016-03-02 b 13.031 Date data2, a 11.432 20.784
>>> df.set_index('Country') Set the index
how='right',
2 2016-03-01 c 20.784 2016-03-01 11.432 NaN 20.784 >>> df4 = df.reset_index() Reset the index b 1.303 NaN
on='X1')
3 2016-03-03 a 99.906 >>> df = df.rename(index=str, Rename DataFrame d NaN 20.784
2016-03-02 1.303 13.031 NaN columns={"Country":"cntry",
4 2016-03-02 a 1.303 "Capital":"cptl", >>> pd.merge(data1,
2016-03-03 99.906 NaN 20.784 "Population":"ppltn"}) X1 X2 X3
5 2016-03-03 c 20.784 data2,
how='inner', a 11.432 20.784
Pivot Table Reindexing on='X1') b 1.303 NaN
>>> s2 = s.reindex(['a','c','d','e','b'])
>>> df4 = pd.pivot_table(df2, Spread rows into columns X1 X2 X3
values='Value', Forward Filling Backward Filling >>> pd.merge(data1,
index='Date', data2, a 11.432 20.784
columns='Type']) >>> df.reindex(range(4), >>> s3 = s.reindex(range(5), how='outer', b 1.303 NaN
method='ffill') method='bfill') on='X1') c 99.906 NaN
Stack / Unstack Country Capital Population 0 3
0 Belgium Brussels 11190846 1 3 d NaN 20.784
>>> stacked = df5.stack() Pivot a level of column labels 1 India New Delhi 1303171035 2 3
>>> stacked.unstack() Pivot a level of index labels 2 Brazil Brasília 207847528 3 3 Join
3 Brazil Brasília 207847528 4 3
0 1 1 5 0 0.233482 >>> data1.join(data2, how='right')
1 5 0.233482 0.390959 1 0.390959 MultiIndexing Concatenate
2 4 0.184713 0.237102 2 4 0 0.184713
>>> arrays = [np.array([1,2,3]),
3 3 0.433522 0.429401 1 0.237102 np.array([5,4,3])] Vertical
>>> df5 = pd.DataFrame(np.random.rand(3, 2), index=arrays) >>> s.append(s2)
Unstacked 3 3 0 0.433522
>>> tuples = list(zip(*arrays)) Horizontal/Vertical
1 0.429401 >>> index = pd.MultiIndex.from_tuples(tuples, >>> pd.concat([s,s2],axis=1, keys=['One','Two'])
Stacked names=['first', 'second']) >>> pd.concat([data1, data2], axis=1, join='inner')
>>> df6 = pd.DataFrame(np.random.rand(3, 2), index=index)
Melt >>> df2.set_index(["Date", "Type"])
>>> pd.melt(df2, Gather columns into rows
Dates
id_vars=["Date"],
value_vars=["Type", "Value"],
Duplicate Data >>> df2['Date']= pd.to_datetime(df2['Date'])
>>> df2['Date']= pd.date_range('2000-1-1',
value_name="Observations") >>> s3.unique() Return unique values periods=6,
>>> df2.duplicated('Type') Check duplicates freq='M')
Date Type Value
Date Variable Observations >>> dates = [datetime(2012,5,1), datetime(2012,5,2)]
0 2016-03-01 Type a >>> df2.drop_duplicates('Type', keep='last') Drop duplicates >>> index = pd.DatetimeIndex(dates)
0 2016-03-01 a 11.432 1 2016-03-02 Type b
>>> df.index.duplicated() Check index duplicates >>> index = pd.date_range(datetime(2012,2,1), end, freq='BM')
1 2016-03-02 b 13.031 2 2016-03-01 Type c
2 2016-03-01 c 20.784 3 2016-03-03 Type a Grouping Data Visualization Also see Matplotlib
4 2016-03-02 Type a
3 2016-03-03 a 99.906
5 2016-03-03 Type c Aggregation >>> import matplotlib.pyplot as plt
4 2016-03-02 a 1.303 >>> df2.groupby(by=['Date','Type']).mean()
6 2016-03-01 Value 11.432 >>> s.plot() >>> df2.plot()
>>> df4.groupby(level=0).sum()
5 2016-03-03 c 20.784 7 2016-03-02 Value 13.031 >>> df4.groupby(level=0).agg({'a':lambda x:sum(x)/len(x), >>> plt.show() >>> plt.show()
8 2016-03-01 Value 20.784 'b': np.sum})
9 2016-03-03 Value 99.906 Transformation
>>> customSum = lambda x: (x+x%2)
10 2016-03-02 Value 1.303
>>> df4.groupby(level=0).transform(customSum)
11 2016-03-03 Value 20.784

Iteration Missing Data


>>> df.dropna() Drop NaN values
>>> df.iteritems() (Column-index, Series) pairs >>> df3.fillna(df3.mean()) Fill NaN values with a predetermined value
>>> df.iterrows() (Row-index, Series) pairs >>> df2.replace("a", "f") Replace values with others
DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Asking For Help Dropping
>>> help(pd.Series.loc)
>>> s.drop(['a', 'c']) Drop values from rows (axis=0)
Pandas Basics Selection Also see NumPy Arrays >>> df.drop('Country', axis=1) Drop values from columns(axis=1)
Learn Python for Data Science Interactively at www.DataCamp.com

>>> s['b'] Get one element Sort & Rank


-5
Pandas >>> df.sort_index() Sort by labels along an axis
>>> df.sort_values(by='Country') Sort by the values along an axis
>>> df[1:] Get subset of a DataFrame
The Pandas library is built on NumPy and provides easy-to-use Country Capital Population >>> df.rank() Assign ranks to entries
data structures and data analysis tools for the Python 1 India New Delhi 1303171035
2 Brazil Brasília 207847528
programming language. Retrieving Series/DataFrame Information
Basic Information
Use the following import convention: By Position >>> df.shape (rows,columns)
>>> import pandas as pd >>> df.iloc[[0],[0]] Select single value by row & >>> df.index Describe index
'Belgium' column >>> df.columns Describe DataFrame columns
Pandas Data Structures >>> df.iat([0],[0])
>>>
>>>
df.info()
df.count()
Info on DataFrame
Number of non-NA values
Series 'Belgium'
Summary
A one-dimensional labeled array a 3 By Label
>>> df.loc[[0], ['Country']] Select single value by row & >>> df.sum() Sum of values
capable of holding any data type b -5 'Belgium' column labels >>> df.cumsum() Cummulative sum of values
>>> df.min()/df.max() Minimum/maximum values
Index c 7 >>> df.at([0], ['Country']) >>> df.idxmin()/df.idxmax() Minimum/Maximum index value
d 4 'Belgium' >>> df.describe() Summary statistics
>>> df.mean() Mean of values
>>> s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
By Label/Position >>> df.median() Median of values
>>> df.ix[2] Select single row of
DataFrame Country
Capital
Brazil
Brasília
subset of rows Applying Functions
Population 207847528 >>> f = lambda x: x*2
Columns
Country Capital Population A two-dimensional labeled >>> df.ix[:,'Capital'] Select a single column of >>> df.apply(f) Apply function
>>> df.applymap(f) Apply function element-wise
data structure with columns 0 Brussels subset of columns
0 Belgium Brussels 11190846 1 New Delhi
of potentially different types 2 Brasília Data Alignment
1 India New Delhi 1303171035
Index >>> df.ix[1,'Capital'] Select rows and columns
2 Brazil Brasília 207847528 Internal Data Alignment
'New Delhi'
NA values are introduced in the indices that don’t overlap:
Boolean Indexing
>>> data = {'Country': ['Belgium', 'India', 'Brazil'], >>> s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
>>> s[~(s > 1)] Series s where value is not >1
'Capital': ['Brussels', 'New Delhi', 'Brasília'], >>> s[(s < -1) | (s > 2)] s where value is <-1 or >2 >>> s + s3
'Population': [11190846, 1303171035, 207847528]} >>> df[df['Population']>1200000000] Use filter to adjust DataFrame a 10.0
b NaN
>>> df = pd.DataFrame(data,
c 5.0
columns=['Country', 'Capital', 'Population']) >>> s['a'] = 6 Set index a of Series s to 6
d 7.0

I/O Arithmetic Operations with Fill Methods


You can also do the internal data alignment yourself with
Read and Write to CSV Read and Write to SQL Query or Database Table
the help of the fill methods:
>>> pd.read_csv( , header=None, nrows=5) >>> from sqlalchemy import create_engine
>>> df.to_csv('myDataFrame.csv') >>> engine = create_engine('sqlite:///:memory:') a 10.0
>>> pd.read_sql("SELECT * FROM my_table;", engine) b -5.0
Read and Write to Excel c 5.0
>>> pd.read_sql_table('my_table', engine) d 7.0
>>> pd.read_excel( ) >>> pd.read_sql_query("SELECT * FROM my_table;", engine)
>>> df.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1')
read_sql()is a convenience wrapper around read_sql_table() and
Read multiple sheets from the same file
read_sql_query()
>>> xlsx = pd.ExcelFile( )
>>> df = pd.read_excel(xlsx, 'Sheet1') >>> df.to_sql('myDf', engine) DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Retrieving RDD Information Reshaping Data
Basic Information Reducing
PySpark - RDD Basics >>> rdd.getNumPartitions() List the number of partitions
>>> rdd.reduceByKey(lambda x,y : x+y)
.collect()
Merge the rdd values for
each key
Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count() Count RDD instances [('a',9),('b',2)]
3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values
>>> rdd.countByKey() Count RDD instances by key ('a',7,'a',2,'b',2)
defaultdict(<type 'int'>,{'a':2,'b':1}) Grouping by
>>> rdd.countByValue() Count RDD instances by value >>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values
Spark defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})
>>> rdd.collectAsMap() Return (key,value) pairs as a
.mapValues(list)
.collect()
PySpark is the Spark Python API that exposes {'a': 2,'b': 2} dictionary >>> rdd.groupByKey() Group rdd by key
>>> rdd3.sum() Sum of RDD elements .mapValues(list)
the Spark programming model to Python. 4950 .collect()
>>> sc.parallelize([]).isEmpty() Check whether RDD is empty [('a',[7,2]),('b',[2])]
True
Initializing Spark Summary
Aggregating
>>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
SparkContext >>> rdd3.max() Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each
99 (4950,100) partition and then the results
>>> from pyspark import SparkContext >>> rdd3.min() Minimum value of RDD elements
>>> sc = SparkContext(master = 'local[2]') >>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key
0
>>> rdd3.mean() Mean value of RDD elements .collect()
Inspect SparkContext 49.5 [('a',(9,2)), ('b',(2,1))]
>>> rdd3.stdev() Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each
>>> sc.version Retrieve SparkContext version 28.866070047722118 4950 partition, and then the results
>>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key
>>> sc.master Master URL to connect to 833.25 .collect()
>>> str(sc.sparkHome) Path where Spark is installed on worker nodes >>> rdd3.histogram(3) Compute histogram by bins [('a',9),('b',2)]
>>> str(sc.sparkUser()) Retrieve name of the Spark User running ([0,33,66,99],[33,33,34])
>>> rdd3.stats() Summary statistics (count, mean, stdev, max & >>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by
SparkContext
>>> sc.appName Return application name min) .collect() applying a function
>>> sc.applicationId Retrieve application ID
>>>
>>>
sc.defaultParallelism Return default level of parallelism
sc.defaultMinPartitions Default minimum number of partitions for Applying Functions Mathematical Operations
RDDs >>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd.subtract(rdd2) Return each rdd value not contained
.collect() .collect() in rdd2
Configuration [('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')] [('b',2),('a',7)]
>>> rdd5 = rdd.flatMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element >>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2
>>> from pyspark import SparkConf, SparkContext and flatten the result .collect() with no matching key in rdd
>>> conf = (SparkConf() >>> rdd5.collect() [('d', 1)]
.setMaster("local") ['a',7,7,'a','a',2,2,'a','b',2,2,'b'] >>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd
.setAppName("My app") >>> rdd4.flatMapValues(lambda x: x) Apply a flatMap function to each (key,value) and rdd2
.set("spark.executor.memory", "1g")) .collect() pair of rdd4 without changing the keys
>>> sc = SparkContext(conf = conf) [('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]
Sort
Using The Shell Selecting Data >>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function
.collect()
In the PySpark shell, a special interpreter-aware SparkContext is already Getting [('d',1),('b',1),('a',2)]
created in the variable called sc. >>> rdd.collect() Return a list with all RDD elements >>> rdd2.sortByKey() Sort (key, value) RDD by key
[('a', 7), ('a', 2), ('b', 2)] .collect()
$ ./bin/spark-shell --master local[2] >>> rdd.take(2) Take first 2 RDD elements [('a',2),('b',1),('d',1)]
$ ./bin/pyspark --master local[4] --py-files code.py [('a', 7), ('a', 2)]
>>> rdd.first() Take first RDD element
Set which master the context connects to with the --master argument, and
('a', 7) Repartitioning
add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd.top(2) Take top 2 RDD elements
[('b', 2), ('a', 7)] >>> rdd.repartition(4) New RDD with 4 partitions
comma-separated list to --py-files. >>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1
Sampling
Loading Data >>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3
[3,4,27,31,40,41,42,43,60,76,79,80,86,97] Saving
Parallelized Collections Filtering >>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.filter(lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",
>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)]) .collect() 'org.apache.hadoop.mapred.TextOutputFormat')
>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)]) [('a',7),('a',2)]
>>> rdd3 = sc.parallelize(range(100)) >>> rdd5.distinct().collect() Return distinct RDD values
>>> rdd4 = sc.parallelize([("a",["x","y","z"]),
("b",["p", "r"])])
['a',2,'b',7]
>>> rdd.keys().collect() Return (key,value) RDD's keys
Stopping SparkContext
['a', 'a', 'b'] >>> sc.stop()
External Data
Read either one text file from HDFS, a local file system or or any Iterating Execution
Hadoop-supported file system URI with textFile(), or read in a directory >>> def g(x): print(x)
>>> rdd.foreach(g) Apply a function to all RDD elements $ ./bin/spark-submit examples/src/main/python/pi.py
of text files with wholeTextFiles().
('a', 7)
>>> textFile = sc.textFile("/my/directory/*.txt") ('b', 2) DataCamp
>>> textFile2 = sc.wholeTextFiles("/my/directory/") ('a', 2) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Duplicate Values GroupBy
>>> df = df.dropDuplicates() >>> df.groupBy("age")\ Group by age, count the members
PySpark - SQL Basics .count() \
.show()
in the groups
Learn Python for data science Interactively at www.DataCamp.com Queries
>>> from pyspark.sql import functions as F
Select Filter
>>> df.select("firstName").show() Show all entries in firstName column >>> df.filter(df["age"]>24).show() Filter entries of age, only keep those
>>> df.select("firstName","lastName") \ records of which the values are >24
PySpark & Spark SQL .show()
>>> df.select("firstName", Show all entries in firstName, age

Spark SQL is Apache Spark's module for "age", and type


Sort
explode("phoneNumber") \
working with structured data. .alias("contactInfo")) \
.select("contactInfo.type", >>> peopledf.sort(peopledf.age.desc()).collect()
>>> df.sort("age", ascending=False).collect()
Initializing SparkSession "firstName",
"age") \ >>> df.orderBy(["age","city"],ascending=[0,1])\
A SparkSession can be used create DataFrame, register DataFrame as tables, .show() .collect()
execute SQL over tables, cache tables, and read parquet files. >>> df.select(df["firstName"],df["age"]+ 1) Show all entries in firstName and age,
.show() add 1 to the entries of age
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession \
>>> df.select(df['age'] > 24).show()
When
Show all entries where age >24 Missing & Replacing Values
.builder \ >>> df.select("firstName", Show firstName and 0 or 1 depending
.appName("Python Spark SQL basic example") \ >>> df.na.fill(50).show() Replace null values
F.when(df.age > 30, 1) \ on age >30 >>> df.na.drop().show() Return new df omitting rows with null values
.config("spark.some.config.option", "some-value") \ .otherwise(0)) \
.getOrCreate() >>> df.na \ Return new df replacing one value with
.show() .replace(10, 20) \ another
>>> df[df.firstName.isin("Jane","Boris")] Show firstName if in the given options .show()
Creating DataFrames Like
.collect()

From RDDs
>>> df.select("firstName", Show firstName, and lastName is
df.lastName.like("Smith")) \ TRUE if lastName is like Smith
Repartitioning
.show()
>>> from pyspark.sql.types import * Startswith - Endswith >>> df.repartition(10)\ df with 10 partitions
>>> df.select("firstName", Show firstName, and TRUE if .rdd \
Infer Schema .getNumPartitions()
>>> sc = spark.sparkContext df.lastName \ lastName starts with Sm
.startswith("Sm")) \ >>> df.coalesce(1).rdd.getNumPartitions() df with 1 partition
>>> lines = sc.textFile("people.txt")
.show()
>>> parts = lines.map(lambda l: l.split(",")) >>> df.select(df.lastName.endswith("th")) \ Show last names ending in th
>>>
>>>
people = parts.map(lambda p: Row(name=p[0],age=int(p[1])))
peopledf = spark.createDataFrame(people)
.show() Running SQL Queries Programmatically
Substring
Specify Schema >>> df.select(df.firstName.substr(1, 3) \ Return substrings of firstName Registering DataFrames as Views
>>> people = parts.map(lambda p: Row(name=p[0], .alias("name")) \
age=int(p[1].strip()))) .collect() >>> peopledf.createGlobalTempView("people")
>>> schemaString = "name age" Between >>> df.createTempView("customer")
>>> fields = [StructField(field_name, StringType(), True) for >>> df.select(df.age.between(22, 24)) \ Show age: values are TRUE if between >>> df.createOrReplaceTempView("customer")
field_name in schemaString.split()] .show() 22 and 24
>>> schema = StructType(fields) Query Views
>>> spark.createDataFrame(people, schema).show()
+--------+---+
| name|age|
Add, Update & Remove Columns >>> df5 = spark.sql("SELECT * FROM customer").show()
+--------+---+ >>> peopledf2 = spark.sql("SELECT * FROM global_temp.people")\
|
|
Mine| 28|
Filip| 29|
Adding Columns .show()
|Jonathan| 30|
+--------+---+ >>> df = df.withColumn('city',df.address.city) \
.withColumn('postalCode',df.address.postalCode) \
From Spark Data Sources .withColumn('state',df.address.state) \
.withColumn('streetAddress',df.address.streetAddress) \
Output
.withColumn('telePhoneNumber', Data Structures
JSON explode(df.phoneNumber.number)) \
>>> df = spark.read.json("customer.json") .withColumn('telePhoneType',
>>> df.show() >>> rdd1 = df.rdd Convert df into an RDD
+--------------------+---+---------+--------+--------------------+ explode(df.phoneNumber.type)) >>> df.toJSON().first() Convert df into a RDD of string
| address|age|firstName |lastName| phoneNumber|
+--------------------+---+---------+--------+--------------------+ >>> df.toPandas() Return the contents of df as Pandas
|[New York,10021,N...| 25|
|[New York,10021,N...| 21|
John|
Jane|
Smith|[[212 555-1234,ho...|
Doe|[[322 888-1234,ho...|
Updating Columns DataFrame
+--------------------+---+---------+--------+--------------------+
>>> df2 = spark.read.load("people.json", format="json")
>>> df = df.withColumnRenamed('telePhoneNumber', 'phoneNumber') Write & Save to Files
Parquet files Removing Columns >>> df.select("firstName", "city")\
>>> df3 = spark.read.load("users.parquet") .write \
TXT files >>> df = df.drop("address", "phoneNumber") .save("nameAndCity.parquet")
>>> df4 = spark.read.text("people.txt") >>> df = df.drop(df.address).drop(df.phoneNumber) >>> df.select("firstName", "age") \
.write \
.save("namesAndAges.json",format="json")
Inspect Data
>>> df.dtypes Return df column names and data types >>> df.describe().show() Compute summary statistics Stopping SparkSession
>>> df.show() Display the content of df >>> df.columns Return the columns of df
>>> df.count() >>> spark.stop()
>>> df.head() Return first n rows Count the number of rows in df
>>> df.first() Return first row >>> df.distinct().count() Count the number of distinct rows in df
>>> df.take(2) Return the first n rows >>> df.printSchema() Print the schema of df DataCamp
>>> df.schema Return the schema of df >>> df.explain() Print the (logical and physical) plans
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Lists Also see NumPy Arrays Libraries
>>> a = 'is' Import libraries
Python Basics >>> b = 'nice' >>> import numpy Data analysis Machine learning
Learn More Python for Data Science Interactively at www.datacamp.com >>> my_list = ['my', 'list', a, b] >>> import numpy as np
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] Selective import
>>> from math import pi Scientific computing 2D plotting
Variables and Data Types Selecting List Elements Index starts at 0
Subset Install Python
Variable Assignment
>>> my_list[1] Select item at index 1
>>> x=5
>>> my_list[-3] Select 3rd last item
>>> x
Slice
5 >>> my_list[1:3] Select items at index 1 and 2
Calculations With Variables >>> my_list[1:] Select items after index 0
>>> my_list[:3] Select items before index 3 Leading open data science platform Free IDE that is included Create and share
>>> x+2 Sum of two variables
>>> my_list[:] Copy my_list powered by Python with Anaconda documents with live code,
7 visualizations, text, ...
>>> x-2 Subtraction of two variables
Subset Lists of Lists
>>> my_list2[1][0] my_list[list][itemOfList]
3
>>> my_list2[1][:2] Numpy Arrays Also see Lists
>>> x*2 Multiplication of two variables
>>> my_list = [1, 2, 3, 4]
10 List Operations >>> my_array = np.array(my_list)
>>> x**2 Exponentiation of a variable
25 >>> my_list + my_list >>> my_2darray = np.array([[1,2,3],[4,5,6]])
>>> x%2 Remainder of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']
Selecting Numpy Array Elements Index starts at 0
1 >>> my_list * 2
>>> x/float(2) Division of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] Subset
2.5 >>> my_list2 > 4 >>> my_array[1] Select item at index 1
True 2
Types and Type Conversion Slice
List Methods >>> my_array[0:2] Select items at index 0 and 1
str() '5', '3.45', 'True' Variables to strings
my_list.index(a) Get the index of an item array([1, 2])
>>>
int() 5, 3, 1 Variables to integers >>> my_list.count(a) Count an item Subset 2D Numpy arrays
>>> my_list.append('!') Append an item at a time >>> my_2darray[:,0] my_2darray[rows, columns]
my_list.remove('!') Remove an item array([1, 4])
float() 5.0, 1.0 Variables to floats >>>
>>> del(my_list[0:1]) Remove an item Numpy Array Operations
bool() True, True, True >>> my_list.reverse() Reverse the list
Variables to booleans >>> my_array > 3
>>> my_list.extend('!') Append an item array([False, False, False, True], dtype=bool)
>>> my_list.pop(-1) Remove an item >>> my_array * 2
Asking For Help >>> my_list.insert(0,'!') Insert an item array([2, 4, 6, 8])
>>> help(str) >>> my_list.sort() Sort the list >>> my_array + np.array([5, 6, 7, 8])
array([6, 8, 10, 12])
Strings
>>> my_string = 'thisStringIsAwesome' Numpy Array Functions
String Operations Index starts at 0
>>> my_string >>> my_array.shape Get the dimensions of the array
'thisStringIsAwesome' >>> my_string[3] >>> np.append(other_array) Append items to an array
>>> my_string[4:9] >>> np.insert(my_array, 1, 5) Insert items in an array
String Operations >>> np.delete(my_array,[1]) Delete items in an array
String Methods >>> np.mean(my_array) Mean of the array
>>> my_string * 2
'thisStringIsAwesomethisStringIsAwesome' >>> my_string.upper() String to uppercase >>> np.median(my_array) Median of the array
>>> my_string + 'Innit' >>> my_string.lower() String to lowercase >>> my_array.corrcoef() Correlation coefficient
'thisStringIsAwesomeInnit' >>> my_string.count('w') Count String elements >>> np.std(my_array) Standard deviation
>>> 'm' in my_string >>> my_string.replace('e', 'i') Replace String elements
True >>> my_string.strip() Strip whitespaces DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Lists Also see NumPy Arrays Libraries
>>> a = 'is' Import libraries
Python Basics >>> b = 'nice' >>> import numpy Data analysis Machine learning
Learn More Python for Data Science Interactively at www.datacamp.com >>> my_list = ['my', 'list', a, b] >>> import numpy as np
>>> my_list2 = [[4,5,6,7], [3,4,5,6]] Selective import
>>> from math import pi Scientific computing 2D plotting
Variables and Data Types Selecting List Elements Index starts at 0
Subset Install Python
Variable Assignment
>>> my_list[1] Select item at index 1
>>> x=5
>>> my_list[-3] Select 3rd last item
>>> x
Slice
5 >>> my_list[1:3] Select items at index 1 and 2
Calculations With Variables >>> my_list[1:] Select items after index 0
>>> my_list[:3] Select items before index 3 Leading open data science platform Free IDE that is included Create and share
>>> x+2 Sum of two variables
>>> my_list[:] Copy my_list powered by Python with Anaconda documents with live code,
7 visualizations, text, ...
>>> x-2 Subtraction of two variables
Subset Lists of Lists
>>> my_list2[1][0] my_list[list][itemOfList]
3
>>> my_list2[1][:2] Numpy Arrays Also see Lists
>>> x*2 Multiplication of two variables
>>> my_list = [1, 2, 3, 4]
10 List Operations >>> my_array = np.array(my_list)
>>> x**2 Exponentiation of a variable
25 >>> my_list + my_list >>> my_2darray = np.array([[1,2,3],[4,5,6]])
>>> x%2 Remainder of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice']
Selecting Numpy Array Elements Index starts at 0
1 >>> my_list * 2
>>> x/float(2) Division of a variable ['my', 'list', 'is', 'nice', 'my', 'list', 'is', 'nice'] Subset
2.5 >>> my_list2 > 4 >>> my_array[1] Select item at index 1
True 2
Types and Type Conversion Slice
List Methods >>> my_array[0:2] Select items at index 0 and 1
str() '5', '3.45', 'True' Variables to strings
my_list.index(a) Get the index of an item array([1, 2])
>>>
int() 5, 3, 1 Variables to integers >>> my_list.count(a) Count an item Subset 2D Numpy arrays
>>> my_list.append('!') Append an item at a time >>> my_2darray[:,0] my_2darray[rows, columns]
my_list.remove('!') Remove an item array([1, 4])
float() 5.0, 1.0 Variables to floats >>>
>>> del(my_list[0:1]) Remove an item Numpy Array Operations
bool() True, True, True >>> my_list.reverse() Reverse the list
Variables to booleans >>> my_array > 3
>>> my_list.extend('!') Append an item array([False, False, False, True], dtype=bool)
>>> my_list.pop(-1) Remove an item >>> my_array * 2
Asking For Help >>> my_list.insert(0,'!') Insert an item array([2, 4, 6, 8])
>>> help(str) >>> my_list.sort() Sort the list >>> my_array + np.array([5, 6, 7, 8])
array([6, 8, 10, 12])
Strings
>>> my_string = 'thisStringIsAwesome' Numpy Array Functions
String Operations Index starts at 0
>>> my_string >>> my_array.shape Get the dimensions of the array
'thisStringIsAwesome' >>> my_string[3] >>> np.append(other_array) Append items to an array
>>> my_string[4:9] >>> np.insert(my_array, 1, 5) Insert items in an array
String Operations >>> np.delete(my_array,[1]) Delete items in an array
String Methods >>> np.mean(my_array) Mean of the array
>>> my_string * 2
'thisStringIsAwesomethisStringIsAwesome' >>> my_string.upper() String to uppercase >>> np.median(my_array) Median of the array
>>> my_string + 'Innit' >>> my_string.lower() String to lowercase >>> my_array.corrcoef() Correlation coefficient
'thisStringIsAwesomeInnit' >>> my_string.count('w') Count String elements >>> np.std(my_array) Standard deviation
>>> 'm' in my_string >>> my_string.replace('e', 'i') Replace String elements
True >>> my_string.strip() Strip whitespaces DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Create Your Model Evaluate Your Model’s Performance
Supervised Learning Estimators Classification Metrics
Scikit-Learn
Learn Python for data science Interactively at www.DataCamp.com Linear Regression Accuracy Score
>>> from sklearn.linear_model import LinearRegression >>> knn.score(X_test, y_test) Estimator score method
>>> lr = LinearRegression(normalize=True) >>> from sklearn.metrics import accuracy_score Metric scoring functions
>>> accuracy_score(y_test, y_pred)
Support Vector Machines (SVM)
Scikit-learn >>> from sklearn.svm import SVC Classification Report
>>> svc = SVC(kernel='linear') >>> from sklearn.metrics import classification_report Precision, recall, f1-score
Scikit-learn is an open source Python library that Naive Bayes >>> print(classification_report(y_test, y_pred)) and support
implements a range of machine learning, >>> from sklearn.naive_bayes import GaussianNB Confusion Matrix
>>> gnb = GaussianNB() >>> from sklearn.metrics import confusion_matrix
preprocessing, cross-validation and visualization >>> print(confusion_matrix(y_test, y_pred))
algorithms using a unified interface. KNN
>>> from sklearn import neighbors Regression Metrics
A Basic Example >>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)
>>> from sklearn import neighbors, datasets, preprocessing
Mean Absolute Error
>>> from sklearn.model_selection import train_test_split Unsupervised Learning Estimators >>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.metrics import accuracy_score >>> y_true = [3, -0.5, 2]
>>> iris = datasets.load_iris() Principal Component Analysis (PCA) >>> mean_absolute_error(y_true, y_pred)
>>> X, y = iris.data[:, :2], iris.target >>> from sklearn.decomposition import PCA Mean Squared Error
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33) >>> pca = PCA(n_components=0.95) >>> from sklearn.metrics import mean_squared_error
>>> scaler = preprocessing.StandardScaler().fit(X_train) >>> mean_squared_error(y_test, y_pred)
>>> X_train = scaler.transform(X_train)
K Means
>>> X_test = scaler.transform(X_test) >>> from sklearn.cluster import KMeans R² Score
>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) >>> k_means = KMeans(n_clusters=3, random_state=0) >>> from sklearn.metrics import r2_score
>>> r2_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
>>> y_pred = knn.predict(X_test)
>>> accuracy_score(y_test, y_pred) Model Fitting Clustering Metrics
Adjusted Rand Index
Supervised learning >>> from sklearn.metrics import adjusted_rand_score
Loading The Data Also see NumPy & Pandas >>> lr.fit(X, y) Fit the model to the data
>>> adjusted_rand_score(y_true, y_pred)
>>> knn.fit(X_train, y_train)
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse >>> svc.fit(X_train, y_train) Homogeneity
>>> from sklearn.metrics import homogeneity_score
matrices. Other types that are convertible to numeric arrays, such as Pandas Unsupervised Learning >>> homogeneity_score(y_true, y_pred)
DataFrame, are also acceptable. >>> k_means.fit(X_train) Fit the model to the data
>>> pca_model = pca.fit_transform(X_train) Fit to data, then transform it V-measure
>>> import numpy as np >>> from sklearn.metrics import v_measure_score
>>> X = np.random.random((10,5)) >>> metrics.v_measure_score(y_true, y_pred)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])
>>> X[X < 0.7] = 0 Prediction Cross-Validation
>>> from sklearn.cross_validation import cross_val_score
Supervised Estimators >>> print(cross_val_score(knn, X_train, y_train, cv=4))
Training And Test Data >>> y_pred = svc.predict(np.random.random((2,5))) Predict labels
>>> y_pred = lr.predict(X_test)
>>> print(cross_val_score(lr, X, y, cv=2))
Predict labels
>>> from sklearn.model_selection import train_test_split >>> y_pred = knn.predict_proba(X_test) Estimate probability of a label
>>> X_train, X_test, y_train, y_test = train_test_split(X,
y, Unsupervised Estimators Tune Your Model
random_state=0) >>> y_pred = k_means.predict(X_test) Predict labels in clustering algos Grid Search
>>> from sklearn.grid_search import GridSearchCV
>>> params = {"n_neighbors": np.arange(1,3),
Preprocessing The Data "metric": ["euclidean", "cityblock"]}
>>> grid = GridSearchCV(estimator=knn,
Standardization Encoding Categorical Features param_grid=params)
>>> grid.fit(X_train, y_train)
>>> from sklearn.preprocessing import StandardScaler >>> from sklearn.preprocessing import LabelEncoder >>> print(grid.best_score_)
>>> scaler = StandardScaler().fit(X_train) >>> print(grid.best_estimator_.n_neighbors)
>>> enc = LabelEncoder()
>>> standardized_X = scaler.transform(X_train) >>> y = enc.fit_transform(y)
>>> standardized_X_test = scaler.transform(X_test) Randomized Parameter Optimization
Normalization Imputing Missing Values >>> from sklearn.grid_search import RandomizedSearchCV
>>> params = {"n_neighbors": range(1,5),
>>> from sklearn.preprocessing import Normalizer "weights": ["uniform", "distance"]}
>>> from sklearn.preprocessing import Imputer >>> rsearch = RandomizedSearchCV(estimator=knn,
>>> scaler = Normalizer().fit(X_train) >>> imp = Imputer(missing_values=0, strategy='mean', axis=0) param_distributions=params,
>>> normalized_X = scaler.transform(X_train) >>> imp.fit_transform(X_train) cv=4,
>>> normalized_X_test = scaler.transform(X_test) n_iter=8,
random_state=5)
Binarization Generating Polynomial Features >>> rsearch.fit(X_train, y_train)
>>> print(rsearch.best_score_)
>>> from sklearn.preprocessing import Binarizer >>> from sklearn.preprocessing import PolynomialFeatures
>>> binarizer = Binarizer(threshold=0.0).fit(X) >>> poly = PolynomialFeatures(5)
>>> binary_X = binarizer.transform(X) >>> poly.fit_transform(X) DataCamp
Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet Linear Algebra Also see NumPy
You’ll use the linalg and sparse modules. Note that scipy.linalg contains and expands on numpy.linalg.
SciPy - Linear Algebra >>> from scipy import linalg, sparse Matrix Functions
Learn More Python for Data Science Interactively at www.datacamp.com
Creating Matrices Addition
>>> np.add(A,D) Addition
>>> A = np.matrix(np.random.random((2,2)))
SciPy >>> B = np.asmatrix(b) Subtraction
>>> C = np.mat(np.random.random((10,5))) >>> np.subtract(A,D) Subtraction
The SciPy library is one of the core packages for >>> D = np.mat([[3,4], [5,6]]) Division
scientific computing that provides mathematical >>> np.divide(A,D) Division
Basic Matrix Routines Multiplication
algorithms and convenience functions built on the
>>> np.multiply(D,A) Multiplication
NumPy extension of Python. Inverse >>> np.dot(A,D) Dot product
>>> A.I Inverse >>> np.vdot(A,D) Vector dot product
>>> linalg.inv(A) Inverse
Interacting With NumPy Also see NumPy >>> A.T Tranpose matrix >>> np.inner(A,D) Inner product
>>> np.outer(A,D) Outer product
>>> import numpy as np >>> A.H Conjugate transposition >>> np.tensordot(A,D) Tensor dot product
>>> a = np.array([1,2,3]) >>> np.trace(A) Trace >>> np.kron(A,D) Kronecker product
>>> b = np.array([(1+5j,2j,3j), (4j,5j,6j)])
>>> c = np.array([[(1.5,2,3), (4,5,6)], [(3,2,1), (4,5,6)]]) Norm Exponential Functions
>>> linalg.norm(A) Frobenius norm >>> linalg.expm(A) Matrix exponential
Index Tricks >>> linalg.norm(A,1) L1 norm (max column sum) >>> linalg.expm2(A) Matrix exponential (Taylor Series)
>>> linalg.norm(A,np.inf) L inf norm (max row sum) >>> linalg.expm3(D) Matrix exponential (eigenvalue
>>> np.mgrid[0:5,0:5] Create a dense meshgrid decomposition)
>>> np.ogrid[0:2,0:2] Create an open meshgrid Rank Logarithm Function
>>> np.r_[[3,[0]*5,-1:1:10j] Stack arrays vertically (row-wise) >>> np.linalg.matrix_rank(C) Matrix rank >>> linalg.logm(A) Matrix logarithm
>>> np.c_[b,c] Create stacked column-wise arrays Determinant Trigonometric Tunctions
>>> linalg.det(A) Determinant >>> linalg.sinm(D) Matrix sine
Shape Manipulation Solving linear problems >>> linalg.cosm(D) Matrix cosine
>>> np.transpose(b) Permute array dimensions >>> linalg.solve(A,b) Solver for dense matrices >>> linalg.tanm(A) Matrix tangent
>>> b.flatten() Flatten the array >>> E = np.mat(a).T Solver for dense matrices Hyperbolic Trigonometric Functions
>>> np.hstack((b,c)) Stack arrays horizontally (column-wise) >>> linalg.lstsq(D,E) Least-squares solution to linear matrix >>> linalg.sinhm(D) Hypberbolic matrix sine
>>> np.vstack((a,b)) Stack arrays vertically (row-wise) equation >>> linalg.coshm(D) Hyperbolic matrix cosine
>>> np.hsplit(c,2) Split the array horizontally at the 2nd index Generalized inverse >>> linalg.tanhm(A) Hyperbolic matrix tangent
>>> np.vpslit(d,2) Split the array vertically at the 2nd index >>> linalg.pinv(C) Compute the pseudo-inverse of a matrix Matrix Sign Function
(least-squares solver) >>> np.sigm(A) Matrix sign function
Polynomials >>> linalg.pinv2(C) Compute the pseudo-inverse of a matrix
>>> from numpy import poly1d (SVD) Matrix Square Root
>>> linalg.sqrtm(A) Matrix square root
>>> p = poly1d([3,4,5]) Create a polynomial object
Creating Sparse Matrices Arbitrary Functions
Vectorizing Functions >>> linalg.funm(A, lambda x: x*x) Evaluate matrix function
>>> F = np.eye(3, k=1) Create a 2X2 identity matrix
>>> def myfunc(a):
if a < 0: >>> G = np.mat(np.identity(2)) Create a 2x2 identity matrix Decompositions
return a*2 >>> C[C > 0.5] = 0
else: >>> H = sparse.csr_matrix(C)
return a/2
Compressed Sparse Row matrix Eigenvalues and Eigenvectors
>>> I = sparse.csc_matrix(D) Compressed Sparse Column matrix >>> la, v = linalg.eig(A) Solve ordinary or generalized
>>> np.vectorize(myfunc) Vectorize functions >>> J = sparse.dok_matrix(A) Dictionary Of Keys matrix eigenvalue problem for square matrix
>>> E.todense() Sparse matrix to full matrix >>> l1, l2 = la Unpack eigenvalues
Type Handling >>> sparse.isspmatrix_csc(A) Identify sparse matrix >>> v[:,0] First eigenvector
>>> v[:,1] Second eigenvector
>>> np.real(c) Return the real part of the array elements
>>> np.imag(c) Return the imaginary part of the array elements Sparse Matrix Routines >>> linalg.eigvals(A) Unpack eigenvalues
>>> np.real_if_close(c,tol=1000) Return a real array if complex parts close to 0 Singular Value Decomposition
>>> np.cast['f'](np.pi) Cast object to a data type Inverse >>> U,s,Vh = linalg.svd(B) Singular Value Decomposition (SVD)
>>> sparse.linalg.inv(I) Inverse >>> M,N = B.shape
Other Useful Functions Norm >>> Sig = linalg.diagsvd(s,M,N) Construct sigma matrix in SVD
>>> sparse.linalg.norm(I) Norm LU Decomposition
>>> np.angle(b,deg=True) Return the angle of the complex argument >>> P,L,U = linalg.lu(C) LU Decomposition
>>> g = np.linspace(0,np.pi,num=5) Create an array of evenly spaced values
Solving linear problems
(number of samples) >>> sparse.linalg.spsolve(H,I) Solver for sparse matrices
>>> g [3:] += np.pi
>>> np.unwrap(g) Unwrap Sparse Matrix Decompositions
>>> np.logspace(0,10,3) Create an array of evenly spaced values (log scale) Sparse Matrix Functions
>>> la, v = sparse.linalg.eigs(F,1) Eigenvalues and eigenvectors
>>> np.select([c<4],[c*2]) Return values from a list of arrays depending on >>> sparse.linalg.expm(I) Sparse matrix exponential >>> sparse.linalg.svds(H, 2) SVD
conditions
>>> misc.factorial(a) Factorial
>>> Combine N things taken at k time
>>>
misc.comb(10,3,exact=True)
misc.central_diff_weights(3) Weights for Np-point central derivative Asking For Help DataCamp
>>> misc.derivative(myfunc,1.0) Find the n-th derivative of a function at a point >>> help(scipy.linalg.diagsvd)
>>> np.info(np.matrix) Learn Python for Data Science Interactively
Python For Data Science Cheat Sheet 3 Plotting With Seaborn
Seaborn Axis Grids
Learn Data Science Interactively at www.DataCamp.com >>> g = sns.FacetGrid(titanic, Subplot grid for plotting conditional >>> h = sns.PairGrid(iris) Subplot grid for plotting pairwise
col="survived", relationships >>> h = h.map(plt.scatter) relationships
row="sex") >>> sns.pairplot(iris) Plot pairwise bivariate distributions
>>> g = g.map(plt.hist,"age") >>> i = sns.JointGrid(x="x", Grid for bivariate plot with marginal
>>> sns.factorplot(x="pclass", Draw a categorical plot onto a y="y", univariate plots
y="survived", Facetgrid data=data)
Statistical Data Visualization With Seaborn hue="sex",
data=titanic)
>>> i = i.plot(sns.regplot,
sns.distplot)
The Python visualization library Seaborn is based on >>> sns.lmplot(x="sepal_width", Plot data and regression model fits >>> sns.jointplot("sepal_length", Plot bivariate distribution
y="sepal_length", across a FacetGrid "sepal_width",
matplotlib and provides a high-level interface for drawing hue="species", data=iris,
attractive statistical graphics. data=iris) kind='kde')

Categorical Plots Regression Plots


Make use of the following aliases to import the libraries: >>> sns.regplot(x="sepal_width", Plot data and a linear regression
Scatterplot
>>> import matplotlib.pyplot as plt y="sepal_length", model fit
>>> sns.stripplot(x="species", Scatterplot with one
>>> import seaborn as sns data=iris,
y="petal_length", categorical variable
data=iris) ax=ax)
The basic steps to creating plots with Seaborn are: >>> sns.swarmplot(x="species", Categorical scatterplot with Distribution Plots
y="petal_length", non-overlapping points
1. Prepare some data data=iris) >>> plot = sns.distplot(data.y, Plot univariate distribution
2. Control figure aesthetics Bar Chart kde=False,
color="b")
3. Plot with Seaborn >>> sns.barplot(x="sex", Show point estimates and
y="survived", confidence intervals with Matrix Plots
4. Further customize your plot hue="class", scatterplot glyphs
>>> sns.heatmap(uniform_data,vmin=0,vmax=1) Heatmap
data=titanic)
>>> import matplotlib.pyplot as plt Count Plot
>>>
>>>
>>>
import seaborn as sns
tips = sns.load_dataset("tips")
sns.set_style("whitegrid") Step 2
Step 1
>>> sns.countplot(x="deck",
data=titanic,
Show count of observations
4 Further Customizations Also see Matplotlib
palette="Greens_d")
>>> g = sns.lmplot(x="tip", Step 3
Point Plot Axisgrid Objects
y="total_bill",
data=tips, >>> sns.pointplot(x="class", Show point estimates and >>> g.despine(left=True) Remove left spine
aspect=2) y="survived", confidence intervals as >>> g.set_ylabels("Survived") Set the labels of the y-axis
>>> g = (g.set_axis_labels("Tip","Total bill(USD)"). hue="sex", rectangular bars >>> g.set_xticklabels(rotation=45) Set the tick labels for x
set(xlim=(0,10),ylim=(0,100))) data=titanic, >>> g.set_axis_labels("Survived", Set the axis labels
Step 4 palette={"male":"g", "Sex")
>>> plt.title("title")
>>> plt.show(g) Step 5 "female":"m"}, >>> h.set(xlim=(0,5), Set the limit and ticks of the
markers=["^","o"], ylim=(0,5), x-and y-axis
linestyles=["-","--"]) xticks=[0,2.5,5],

1
Boxplot yticks=[0,2.5,5])
Data Also see Lists, NumPy & Pandas >>> sns.boxplot(x="alive", Boxplot
Plot
y="age",
>>> import pandas as pd hue="adult_male",
>>> import numpy as np >>> plt.title("A Title") Add plot title
data=titanic)
>>> uniform_data = np.random.rand(10, 12) >>> plt.ylabel("Survived") Adjust the label of the y-axis
>>> sns.boxplot(data=iris,orient="h") Boxplot with wide-form data
>>> data = pd.DataFrame({'x':np.arange(1,101), >>> plt.xlabel("Sex") Adjust the label of the x-axis
'y':np.random.normal(0,4,100)}) Violinplot >>> plt.ylim(0,100) Adjust the limits of the y-axis
>>> sns.violinplot(x="age", Violin plot >>> plt.xlim(0,10) Adjust the limits of the x-axis
Seaborn also offers built-in data sets: y="sex", >>> plt.setp(ax,yticks=[0,5]) Adjust a plot property
>>> titanic = sns.load_dataset("titanic") hue="survived", >>> plt.tight_layout() Adjust subplot params
>>> iris = sns.load_dataset("iris") data=titanic)

2 Figure Aesthetics Also see Matplotlib


5 Show or Save Plot Also see Matplotlib
>>> plt.show() Show the plot
Context Functions >>> plt.savefig("foo.png") Save the plot as a figure
>>> f, ax = plt.subplots(figsize=(5,6)) Create a figure and one subplot >>> plt.savefig("foo.png", Save transparent figure
>>> sns.set_context("talk") Set context to "talk" transparent=True)
>>> sns.set_context("notebook", Set context to "notebook",
Seaborn styles font_scale=1.5, Scale font elements and
>>> sns.set() (Re)set the seaborn default
rc={"lines.linewidth":2.5}) override param mapping Close & Clear Also see Matplotlib
>>> sns.set_style("whitegrid") Set the matplotlib parameters Color Palette >>> plt.cla() Clear an axis
>>> sns.set_style("ticks", Set the matplotlib parameters >>> plt.clf() Clear an entire figure
{"xtick.major.size":8, >>> sns.set_palette("husl",3) Define the color palette >>> plt.close() Close a window
"ytick.major.size":8}) >>> sns.color_palette("husl") Use with with to temporarily set palette
>>> sns.axes_style("whitegrid") Return a dict of params or use with >>> flatui = ["#9b59b6","#3498db","#95a5a6","#e74c3c","#34495e","#2ecc71"]
with to temporarily set the style >>> sns.set_palette(flatui) Set your own color palette DataCamp
Learn Python for Data Science Interactively
Python for Data Science Cheat Sheet Spans Syntax iterators
Learn more Python for data science interactively at www.datacamp.com
Accessing spans Sentences USUALLY NEEDS THE DEPENDENCY PARSER

Span indices are exclusive. So doc[2:4] is a span starting at doc = nlp("This a sentence. This is another one.")
token 2, up to – but not including! – token 4.
About spaCy # doc.sents is a generator that yields sentence spans
[sent.text for sent in doc.sents]
spaCy is a free, open-source library for advanced Natural doc = nlp("This is a text") # ['This is a sentence.', 'This is another one.']
Language Processing (NLP) in Python. It's designed span = doc[2:4]
specifically for production use and helps you build span.text
applications that process and "understand" large volumes # 'a text'
Base noun phrases NEEDS THE TAGGER AND PARSER
of text. Documentation: spacy.io
doc = nlp("I have a red car")
Creating a span manually
# doc.noun_chunks is a generator that yields spans
[chunk.text for chunk in doc.noun_chunks]
$ pip install spacy # Import the Span object
# ['I', 'a red car']
from spacy.tokens import Span
# Create a Doc object
import spacy
doc = nlp("I live in New York")
# Span for "New York" with label GPE (geopolitical)
span = Span(doc, 3, 5, label="GPE") Label explanations
Statistical models span.text
# 'New York' spacy.explain("RB")
# 'adverb'
Download statistical models spacy.explain("GPE")
Predict part-of-speech tags, dependency labels, named # 'Countries, cities, states'
entities and more. See here for available models: Linguistic features
spacy.io/models
Attributes return label IDs. For string labels, use the
$ python -m spacy download en_core_web_sm attributes with an underscore. For example, token.pos_ . Visualizing
Check that your installed models are up to date If you're in a Jupyter notebook, use displacy.render .
Part-of-speech tags PREDICTED BY STATISTICAL MODEL Otherwise, use displacy.serve to start a web server and
$ python -m spacy validate show the visualization in your browser.
doc = nlp("This is a text.")
Loading statistical models # Coarse-grained part-of-speech tags from spacy import displacy
[token.pos_ for token in doc]
import spacy # ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']
# Load the installed model "en_core_web_sm" # Fine-grained part-of-speech tags Visualize dependencies
nlp = spacy.load("en_core_web_sm") [token.tag_ for token in doc]
# ['DT', 'VBZ', 'DT', 'NN', '.'] doc = nlp("This is a sentence")
displacy.render(doc, style="dep")

Documents and tokens Syntactic dependencies PREDICTED BY STATISTICAL MODEL

doc = nlp("This is a text.")


Processing text # Dependency labels
Processing text with the nlp object returns a Doc object [token.dep_ for token in doc]
that holds all information about the tokens, their linguistic # ['nsubj', 'ROOT', 'det', 'attr', 'punct']
features and their relationships # Syntactic head token (governor)
[token.head.text for token in doc]
# ['is', 'is', 'text', 'is', 'is']
doc = nlp("This is a text")
Visualize named entities
Accessing token attributes Named entities PREDICTED BY STATISTICAL MODEL
doc = nlp("Larry Page founded Google")
doc = nlp("This is a text") doc = nlp("Larry Page founded Google") displacy.render(doc, style="ent")
# Token texts # Text and label of named entity span
[token.text for token in doc] [(ent.text, ent.label_) for ent in doc.ents]
# ['This', 'is', 'a', 'text'] # [('Larry Page', 'PERSON'), ('Google', 'ORG')]
Word vectors and similarity Extension attributes Rule-based matching
To use word vectors, you need to install the larger models Custom attributes that are registered on the global Doc ,
ending in md or lg , for example en_core_web_lg . Token and Span classes and become available as ._ . Token patterns
# "love cats", "loving cats", "loved cats"
Comparing similarity
from spacy.tokens import Doc, Token, Span pattern1 = [{"LEMMA": "love"}, {"LOWER": "cats"}]
doc1 = nlp("I like cats") doc = nlp("The sky over New York is blue") # "10 people", "twenty people"
doc2 = nlp("I like dogs") pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]
# Compare 2 documents # "book", "a cat", "the sea" (noun + optional article)
doc1.similarity(doc2) Attribute extensions WITH DEFAULT VALUE pattern3 = [{"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]
# Compare 2 tokens
# Register custom attribute on Token class
doc1[2].similarity(doc2[2])
# Compare tokens and spans
Token.set_extension("is_color", default=False) Operators and quantifiers
# Overwrite extension attribute with default value
doc1[0].similarity(doc2[1:3])
doc[6]._.is_color = True Can be added to a token dict as the "OP" key.

Accessing word vectors ! Negate pattern and match exactly 0 times.


Property extensions WITH GETTER & SETTER
? Make pattern optional and match 0 or 1 times.
# Vector as a numpy array # Register custom attribute on Doc class
doc = nlp("I like cats") get_reversed = lambda doc: doc.text[::-1] + Require pattern to match 1 or more times.
# The L2 norm of the token's vector Doc.set_extension("reversed", getter=get_reversed)
doc[2].vector * Allow pattern to match 0 or more times.
# Compute value of extension attribute with getter
doc[2].vector_norm doc._.reversed
# 'eulb si kroY weN revo yks ehT'
Glossary
Method extensions CALLABLE METHOD Tokenization Segmenting text into words, punctuation etc.
Pipeline components
Lemmatization Assigning the base forms of words, for example:
Functions that take a Doc object, modify it and return it. # Register custom attribute on Span class
"was" → "be" or "rats" → "rat".
has_label = lambda span, label: span.label_ == label
Span.set_extension("has_label", method=has_label) Sentence Boundary Finding and segmenting individual sentences.
nlp
# Compute value of extension attribute with method Detection

Text tokenizer tagger parser ner ... Doc doc[3:5].has_label("GPE") Part-of-speech (POS) Assigning word types to tokens like verb or noun.
# True Tagging

Dependency Parsing Assigning syntactic dependency labels,


describing the relations between individual
Pipeline information tokens, like subject or object.
Rule-based matching
nlp = spacy.load("en_core_web_sm") Named Entity Labeling named "real-world" objects, like
nlp.pipe_names Recognition (NER) persons, companies or locations.
# ['tagger', 'parser', 'ner'] Using the matcher Text Classification Assigning categories or labels to a whole
nlp.pipeline
document, or parts of a document.
# [('tagger', <spacy.pipeline.Tagger>), # Matcher is initialized with the shared vocab
# ('parser', <spacy.pipeline.DependencyParser>), from spacy.matcher import Matcher Statistical model Process for making predictions based on
# ('ner', <spacy.pipeline.EntityRecognizer>)] # Each dict represents one token and its attributes examples.
matcher = Matcher(nlp.vocab) Training Updating a statistical model with new examples.
# Add with ID, optional callback and pattern(s)
Custom components
pattern = [{"LOWER": "new"}, {"LOWER": "york"}]
# Function that modifies the doc and returns it matcher.add("CITIES", None, pattern)
def custom_component(doc): # Match by calling the matcher on a Doc object
print("Do something to the doc here!") doc = nlp("I live in New York")
return doc matches = matcher(doc)
# Add the component first in the pipeline # Matches are (match_id, start, end) tuples
nlp.add_pipe(custom_component, first=True) for match_id, start, end in matches: Learn Python for
# Get the matched span by slicing the Doc data science interactively at
span = doc[start:end]
Components can be added first , last (default), or www.datacamp.com
print(span.text)
before or after an existing component. # 'New York'
R For Data Science Cheat Sheet dplyr ggplot2
Tidyverse for Beginners Filter Scatter plot
Learn More R for Data Science Interactively at www.datacamp.com filter() allows you to select a subset of rows in a data frame. Scatter plots allow you to compare two variables within your data. To do this with
ggplot2, you use geom_point()
> iris %>% Select iris data of species
filter(Species=="virginica") "virginica" > iris_small <- iris %>%
> iris %>% Select iris data of species filter(Sepal.Length > 5)
Tidyverse filter(Species=="virginica", "virginica" and sepal length > ggplot(iris_small, aes(x=Petal.Length,
y=Petal.Width)) +
Compare petal
width and length
Sepal.Length > 6) greater than 6.
The tidyverse is a powerful collection of R packages that are actually geom_point()
data tools for transforming and visualizing data. All packages of the
tidyverse share an underlying philosophy and common APIs. Arrange Additional Aesthetics
arrange() sorts the observations in a dataset in ascending or descending order • Color
The core packages are: based on one of its variables. > ggplot(iris_small, aes(x=Petal.Length,
y=Petal.Width,
• ggplot2, which implements the grammar of graphics. You can use it > iris %>% Sort in ascending order of color=Species)) +
to visualize your data. arrange(Sepal.Length) sepal length geom_point()
> iris %>% Sort in descending order of
• dplyr is a grammar of data manipulation. You can use it to solve the arrange(desc(Sepal.Length)) sepal length • Size
most common data manipulation challenges. > ggplot(iris_small, aes(x=Petal.Length,
Combine multiple dplyr verbs in a row with the pipe operator %>%: y=Petal.Width,
color=Species,
• tidyr helps you to create tidy data or data where each variable is in a > iris %>% Filter for species "virginica"
size=Sepal.Length)) +
column, each observation is a row end each value is a cell. filter(Species=="virginica") %>% then arrange in descending
geom_point()
arrange(desc(Sepal.Length)) order of sepal length
• readr is a fast and friendly way to read rectangular data. Faceting
Mutate > ggplot(iris_small, aes(x=Petal.Length,
y=Petal.Width)) +
• purrr enhances R’s functional programming (FP) toolkit by providing a mutate() allows you to update or create new columns of a data frame.
geom_point()+
complete and consistent set of tools for working with functions and facet_wrap(~Species)
vectors. > iris %>% Change Sepal.Length to be
mutate(Sepal.Length=Sepal.Length*10) in millimeters
• tibble is a modern re-imaginging of the data frame. > iris %>% Create a new column Line Plots
mutate(SLMm=Sepal.Length*10) called SLMm
> by_year <- gapminder %>%
Combine the verbs filter(), arrange(), and mutate(): group_by(year) %>%
• stringr provides a cohesive set of functions designed to make summarize(medianGdpPerCap=median(gdpPercap))
> iris %>%
working with strings as easy as posssible > ggplot(by_year, aes(x=year,
filter(Species=="Virginica") %>%
y=medianGdpPerCap))+
mutate(SLMm=Sepal.Length*10) %>% geom_line()+
• forcats provide a suite of useful tools that solve common problems arrange(desc(SLMm)) expand_limits(y=0)
with factors.
Summarize Bar Plots
You can install the complete tidyverse with:
> install.packages("tidyverse") summarize() allows you to turn many observations into a single data point.
> by_species <- iris %>%
> iris %>% Summarize to find the filter(Sepal.Length>6) %>%
Then, load the core tidyverse and make it available in your current R summarize(medianSL=median(Sepal.Length)) median sepal length group_by(Species) %>%
session by running: > iris %>% Filter for virginica then summarize(medianPL=median(Petal.Length))
> library(tidyverse) filter(Species=="virginica") %>% summarize the median > ggplot(by_species, aes(x=Species,
summarize(medianSL=median(Sepal.Length)) sepal length y=medianPL)) +
Note: there are many other tidyverse packages with more specialised usage. They are not geom_col()
loaded automatically with library(tidyverse), so you’ll need to load each one with its own call You can also summarize multiple variables at once:
to library().
> iris %>% Histograms
Useful Functions filter(Species=="virginica") %>%
summarize(medianSL=median(Sepal.Length), > ggplot(iris_small, aes(x=Petal.Length))+
> tidyverse_conflicts() Conflicts between tidyverse and other maxSL=max(Sepal.Length)) geom_histogram()
packages
> tidyverse_deps() List all tidyverse dependencies group_by() allows you to summarize within groups instead of summarizing the
> tidyverse_logo() Get tidyverse logo, using ASCII or unicode entire dataset:
characters
> tidyverse_packages() List all tidyverse packages
> iris %>% Find median and max Box Plots
group_by(Species) %>% sepal length of each
> tidyverse_update() Update tidyverse packages summarize(medianSL=median(Sepal.Length), species > ggplot(iris_small, aes(x=Species,
maxSL=max(Sepal.Length)) y=Sepal.Width))+
Loading in the data > iris %>%
filter(Sepal.Length>6) %>%
Find median and max
petal length of each
geom_boxplot()

> library(datasets) Load the datasets package group_by(Species) %>% species with sepal
> library(gapminder) Load the gapminder package summarize(medianPL=median(Petal.Length), length > 6
> attach(iris) Attach iris data to the R search path maxPL=max(Petal.Length)) DataCamp
Learn R for Data Science Interactively
R For Data Science Cheat Sheet Export xts Objects
> data_xts <- as.xts(matrix)
Missing Values
> na.omit(xts5) Omit NA values in xts5
xts > tmp <- tempfile()
> write.zoo(data_xts,sep=",",file=tmp)
> xts_last <- na.locf(xts2) Fill missing values in xts2 using
Learn R for data science Interactively at www.DataCamp.com last observation
> xts_last <- na.locf(xts2, Fill missing values in xts2 using
fromLast=TRUE) next observation
Replace & Update > na.approx(xts2) Interpolate NAs using linear
> xts2[dates] <- 0 Replace values in xts2 on dates with 0 approximation
xts > xts5["1961"] <- NA Replace dates from 1961 with NA

eXtensible Time Series (xts) is a powerful package that


> xts2["2016-05-02"] <- NA Replace the value at 1 specific index with NA
Arithmetic Operations
provides an extensible time series class, enabling uniform coredata()or as.numeric()
handling of many R time series classes by extending zoo.
Applying Functions
> ep1 <- endpoints(xts4,on="weeks",k=2) Take index values by time > xts3 + as.numeric(xts2) Addition
[1] 0 5 10 > xts3 * as.numeric(xts4)
Load the package as follows: > ep2 <- endpoints(xts5,on="years")
Multiplication
> coredata(xts4) - xts3 Subtraction
> library(xts) [1] 0 12 24 36 48 60 72 84 96 108 120 132 144 > coredata(xts4) / xts3 Division
> period.apply(xts5,INDEX=ep2,FUN=mean) Calculate the yearly mean
xts Objects > xts5_yearly <- split(xts5,f="years") Split xts5 by year Shifting Index Values
> lapply(xts5_yearly,FUN=mean) Create a list of yearly means
xts objects have three main components: > do.call(rbind, Find the last observation in > xts5 - lag(xts5) Period-over-period differences
- coredata: always a matrix for xts objects, while it could also be a lapply(split(xts5,"years"), each year in xts5 > diff(xts5,lag=12,differences=1) Lagged differences
vector for zoo objects function(w) last(w,n="1 month")))
> do.call(rbind, Calculate cumulative annual
- index: vector of any Date, POSIXct, chron, yearmon, lapply(split(xts5,"years"), passengers
Reindexing
yearqtr, or DateTime classes cumsum)) > xts1 + merge(xts2,index(xts1),fill=0) Addition
- xtsAttributes: arbitrary attributes > rollapply(xts5, 3, sd) Apply sd to rolling margins of xts5 e1
2017-05-04 5.231538
2017-05-05 5.829257
Creating xts Objects Selecting, Subsetting & Indexing 2017-05-06
2017-05-07
4.000000
3.000000
2017-05-08 2.000000
> xts1 <- xts(x=1:10, order.by=Sys.Date()-1:10) Select 2017-05-09 1.000000
> data <- rnorm(5)
> xts1 - merge(xts2,index(xts1),fill=na.locf) Subtraction
> dates <- seq(as.Date("2017-05-01"),length=5,by="days") > mar55 <- xts5["1955-03"] Get value for March 1955 e1
> xts2 <- xts(x=data, order.by=dates) 2017-05-04 5.231538
> xts3 <- xts(x=rnorm(10), Subset 2017-05-05
2017-05-06
5.829257
4.829257
order.by=as.POSIXct(Sys.Date()+1:10), 2017-05-07 3.829257
born=as.POSIXct("1899-05-08")) > xts5_1954 <- xts5["1954"] Get all data from 1954 2017-05-08 2.829257
> xts4 <- xts(x=1:10, order.by=Sys.Date()+1:10) > xts5_janmarch <- xts5["1954/1954-03"] Extract data from Jan to March ‘54 2017-05-09 1.829257
> xts5_janmarch <- xts5["/1954-03"] Get all data until March ‘54
Convert To And From xts > xts4[ep1] Subset xts4 using ep2
Merging
> data(AirPassengers) first() and last() > merge(xts2,xts1,join='inner') Inner join of xts2 and xts1
> xts5 <- as.xts(AirPassengers)
> first(xts4,'1 week') Extract first 1 week xts2 xts1
2017-05-05 -0.8382068 10
Import From Files > first(last(xts4,'1 week'),'3 days') Get first 3 days of the last week of data
> merge(xts2,xts1,join='left',fill=0) Left join of xts2 and xts1,
> dat <- read.csv(tmp_file) Indexing fill empty spots with 0
> xts(dat, order.by=as.Date(rownames(dat),"%m/%d/%Y")) xts2 xts1
2017-05-01 1.7482704 0
> dat_zoo <- read.zoo(tmp_file, > xts2[index(xts3)] Extract rows with the index of xts3 2017-05-02 -0.2314678 0
index.column=0, > days <- c("2017-05-03","2017-05-23") 2017-05-03 0.1685517 0
sep=",", > xts3[days] Extract rows using the vector days 2017-05-04 1.1685649 0
2017-05-05 -0.8382068 10
format="%m/%d/%Y") > xts2[as.POSIXct(days,tz="UTC")] Extract rows using days as POSIXct
> dat_zoo <- read.zoo(tmp,sep=",",FUN=as.yearmon) > index <- which(.indexwday(xts1)==0|.indexwday(xts1)==6) Index of weekend days > rbind(xts1, xts4) Combine xts1 and xts4 by
> dat_xts <- as.xts(dat_zoo) > xts1[index] Extract weekend days of xts1 rows

Inspect Your Data


> core_data <- coredata(xts2) Extract core data of objects Periods, Periodicity & Timestamps Other Useful Functions
> index(xts1) Extract index of objects
> periodicity(xts5) Estimate frequency of observations > .index(xts4) Extract raw numeric index of xts1
Class Attributes > to.yearly(xts5) Convert xts5 to yearly OHLC > .indexwday(xts3) Value of week(day), starting on Sunday,
> to.monthly(xts3) Convert xts3 to monthly OHLC in index of xts3
> to.quarterly(xts5) Convert xts5 to quarterly OHLC > .indexhour(xts3) Value of hour in index of xts3
> indexClass(xts2) Get index class > start(xts3) Extract first observation of xts3
> indexClass(convertIndex(xts,'POSIXct')) Replacing index class > to.period(xts5,period="quarters") Convert to quarterly OHLC > end(xts4) Extract last observation of xts4
> indexTZ(xts5) Get index class > to.period(xts5,period="years") Convert to yearly OHLC > str(xts3) Display structure of xts3
> indexFormat(xts5) <- "%Y-%m-%d" Change format of time display > nmonths(xts5) Count the months in xts5 > time(xts1) Extract raw numeric index of xts1
> nquarters(xts5) Count the quarters in xts5 > head(xts2) First part of xts2
Time Zones > nyears(xts5) Count the years in xts5 > tail(xts2) Last part of xts2
> make.index.unique(xts3,eps=1e-4) Make index unique
> tzone(xts1) <- "Asia/Hong_Kong" Change the time zone > make.index.unique(xts3,drop=TRUE) Remove duplicate times
> tzone(xts1) Extract the current time zone > align.time(xts3,n=3600) Round index time to the next n seconds DataCamp
Learn R for Data Science Interactively

You might also like