You are on page 1of 61

Google Data Engineering Cheatsheet

Compiled by Maverick Lin (http://mavericklin.com)


Last Updated August 4, 2018

What is Data Engineering? Hadoop Overview Hadoop Ecosystem


Data engineering enables data-driven decision making by Hadoop An entire ecosystem of tools have emerged around
collecting, transforming, and visualizing data. A data Data can no longer fit in memory on one machine (mo- Hadoop, which are based on interacting with HDFS.
engineer designs, builds, maintains, and troubleshoots nolithic), so a new way of computing was devised using
data processing systems with a particular emphasis on the many computers to process the data (distributed). Such Hive: data warehouse software built o top of Hadoop that
security, reliability, fault-tolerance, scalability, fidelity, a group is called a cluster, which makes up server farms. facilitates reading, writing, and managing large datasets
and efficiency of such systems. All of these servers have to be coordinated in the following residing in distributed storage using SQL-like queries (Hi-
ways: partition data, coordinate computing tasks, handle veQL). Hive abstracts away underlying MapReduce jobs
A data engineer also analyzes data to gain insight into fault tolerance/recovery, and allocate capacity to process. and returns HDFS in the form of tables (not HDFS).
business outcomes, builds statistical models to support Pig: high level scripting language (Pig Latin) that enables
decision-making, and creates machine learning models to Hadoop is an open source distributed processing fra- writing complex data transformations. It pulls unstructu-
automate and simplify key business processes. mework that manages data processing and storage for big red/incomplete data from sources, cleans it, and places it
data applications running in clustered systems. It is com- in a database/data warehouses. Pig performs ETL into
Key Points prised of 3 main components: data warehouse while Hive queries from data warehouse
• Build/maintain data structures and databases • Hadoop Distributed File System (HDFS): to perform analysis (GCP: DataFlow).
• Design data processing systems a distributed file system that provides high- Spark: framework for writing fast, distributed programs
• Analyze data and enable machine learning throughput access to application data by partitio- for data processing and analysis. Spark solves similar pro-
• Design for reliability ning data across many machines blems as Hadoop MapReduce but with a fast in-memory
• Visualize data and advocate policy • YARN: framework for job scheduling and cluster approach. It is an unified engine that supports SQL que-
• Model business processes for analysis resource management (task coordination) ries, streaming data, machine learning and graph proces-
• Design for security and compliance • MapReduce: YARN-based system for parallel sing. Can operate separately from Hadoop but integrates
processing of large data sets on multiple machines well with Hadoop. Data is processed using Resilient Dis-
Google Compute Platform (GCP) tributed Datasets (RDDs), which are immutable, lazily
HDFS evaluated, and tracks lineage. ;
GCP is a collection of Google computing resources, which Each disk on a different machine in a cluster is comprised Hbase: non-relational, NoSQL, column-oriented data-
are offered via services. Data engineering services include ;
of 1 master node; the rest are data nodes. The master base management system that runs on top of HDFS. Well
Compute, Storage, Big Data, and Machine Learning. node manages the overall file system by storing the suited for sparse data sets (GCP: BigTable)
directory structure and metadata of the files. The Flink/Kafka: stream processing framework. Batch stre-
The 4 ways to interact with GCP include the console, data nodes physically store the data. Large files are aming is for bounded, finite datasets, with periodic up-
command-line-interface (CLI), API, and mobile app. broken up/distributed across multiple machines, which dates, and delayed processing. Stream processing is for
are replicated across 3 machines to provide fault tolerance. unbounded datasets, with continuous updates, and imme-
The GCP resource hierarchy is organized as follows. All diate processing. Stream data and stream processing must
resources (VMs, storage buckets, etc) are organized into MapReduce be decoupled via a message queue. Can group streaming
projects. These projects may be organized into folders, Parallel programming paradigm which allows for proces- data (windows) using tumbling (non-overlapping time),
which can contain other folders. All folders and projects; sing of huge amounts of data by running processes on mul- sliding (overlapping time), or session (session gap) win-
can be brought together under an organization node. tiple machines. Defining a MapReduce job requires two dows.
Project folders and organization nodes are where policies stages: map and reduce. Beam: programming model to define and execute data
can be defined. Policies are inherited downstream and • Map: operation to be performed in parallel on small processing pipelines, including ETL, batch and stream
dictate who can access what resources. Every resource portions of the dataset. the output is a key-value (continuous) processing. After building the pipeline, it
must belong to a project and every must have a billing pair < K, V > is executed by one of Beam’s distributed processing back-
account associated with it. • Reduce: operation to combine the results of Map ends (Apache Apex, Apache Flink, Apache Spark, and
Google Cloud Dataflow). Modeled as a Directed Acyclic
Advantages: Performance (fast solutions), Pricing (sub- YARN- Yet Another Resource Negotiator Graph (DAG).
hour billing, sustained use discounts, custom machine ty- Coordinates tasks running on the cluster and assigns new Oozie: workflow scheduler system to manage Hadoop jobs
pes), PaaS Solutions, Robust Infrastructure nodes in case of failure. Comprised of 2 subcomponents: Sqoop: transferring framework to transfer large amounts
the resource manager and the node manager. The re- of data into HDFS from relational databases (MySQL)
source manager runs on a single master node and sche-
dules tasks across nodes. The node manager runs on all
other nodes and manages tasks on the individual node.
IAM Key Concepts Compute Choices
Identity Access Management (IAM) OLAP vs. OLTP Google App Engine
Access management service to manage different members Online Analytical Processing (OLAP): primary Flexible, serverless platform for building highly available
of the platform- who has what access for which resource. objective is data analysis. It is an online analysis and applications. Ideal when you want to focus on writing
data retrieving process, characterized by a large volume and developing code and do not want to manage servers,
Each member has roles and permissions to allow them of data and complex queries, uses data warehouses. clusters, or infrastructures.
access to perform their duties on the platform. 3 member Online Transaction Processing (OLTP): primary Use Cases: web sites, mobile app and gaming backends,
types: Google account (single person, gmail account), objective is data processing, manages database modifi- RESTful APIs, IoT apps.
service account (non-person, application), and Google cation, characterized by large numbers of short online
Group (multiple people). Roles are a set of specific transactions, simple queries, and traditional DBMS. Google Kubernetes (Container) Engine
permissions for members. Cannot assign permissions to Logical infrastructure powered by Kubernetes, an open-
user directly, must grant roles. Row vs. Columnar Database source container orchestration system. Ideal for managing
Row Format: stores data by row containers in production, increase velocity and operatabi-
If you grant a member access on a higher hierarchy level, Column Format: stores data tables by column rather lity, and don’t have OS dependencies.
that member will have access to all levels below that than by row, which is suitable for analytical query proces- Use Cases: containerized workloads, cloud-native
hierarchy level as well. You cannot be restricted a lower sing and data warehouses distributed systems, hybrid applications.
level. The policy is a union of assigned and inherited
policies. Google Compute Engine (IaaS)
Virtual Machines (VMs) running in Google’s global data
Primitive Roles: Owner (full access to resources, center. Ideal for when you need complete control over
manage roles), Editor (edit access to resources, change or your infrastructure and direct access to high-performance
add), Viewer (read access to resources) hardward or need OS-level changes.
Predefined Roles: finer-grained access control than Use Cases: any workload requiring a specific OS or
primitive roles, predefined by Google Cloud OS configuration, currently deployed and on-premises
Custom Roles software that you want to run in the cloud.

Best Practice: use predefined roles when they exist (over Summary: AppEngine is the PaaS option- serverless
primitive). Follow the principle of least privileged favors. IaaS, Paas, SaaS and ops free. ComputeEngine is the IaaS option- fully
IaaS: gives you the infrastructure pieces (VMs) but you controllable down to OS level. Kubernetes Engine is in
have to maintain/join together the different infrastructure the middle- clusters of machines running Kuberenetes and
Stackdriver pieces for your application to work. Most flexible option. hosting containers.
GCP’s monitoring, logging, and diagnostics solution. Pro- PaaS: gives you all the infrastructure pieces already
vides insights to health, performance, and availability of joined so you just have to deploy source code on the
applications. platform for your application to work. PaaS solutions are
Main Functions managed services/no-ops (highly available/reliable) and
• Debugger: inspect state of app in real time without serverless/autoscaling (elastic). Less flexible than IaaS
stopping/slowing down e.g. code behavior
• Error Reporting: counts, analyzes, aggregates Fully Managed, Hotspotting Additional Notes
crashes in cloud services You can also mix and match multiple compute options.
• Monitoring: overview of performance, uptime and Preemptible Instances: instances that run at a much
heath of cloud services (metrics, events, metadata) lower price but may be terminated at any time, self-
• Alerting: create policies to notify you when health terminate after 24 hours. ideal for interruptible workloads
and uptime check results exceed a certain limit Snapshots: used for backups of disks
• Tracing: tracks how requests propagate through Images: VM OS (Ubuntu, CentOS)
applications/receive near real-time performance re-
sults, latency reports of VMs
• Logging: store, search, monitor and analyze log
data and events from GCP
Storage BigTable BigTable Part II
Persistent Disk Fully-managed block storage (SSDs) Columnar database ideal applications that need high Designing Your Schema
that is suitable for VMs/containers. Good for snapshots throughput, low latencies, and scalability (IoT, user analy- Designing a Bigtable schema is different than designing a
of data backups/sharing read-only data across VMs. tics, time-series data, graph data) for non-structured schema for a RDMS. Important considerations:
key/value data (each value is < 10 MB). A single value • Each table has only one index, the row key (4 KB)
Cloud Storage Infinitely scalable, fully-managed and in each row is indexed and this value is known as the row • Rows are sorted lexicographically by row key.
highly reliable object/blob storage. Good for data blobs: key. Does not support SQL queries. Zonal service. Not • All operations are atomic (ACID) at row level.
images, pictures, videos. Cannot query by content. good for data less than 1TB of data or items greater than • Both reads and writes should be distributed evenly
10MB. Ideal at handling large amounts of data (TB/PB) • Try to keep all info for an entity in a single row
To use Cloud Storage, you create buckets to store data and for long periods of time. • Related entities should be stored in adjacent rows
the location can be specified. Bucket names are globally • Try to store 10 MB in a single cell (max 100 MB)
unique. There a 4 storage classes: Data Model and 100 MB in a single row (256 MB)
• Multi-Regional: frequent access from anywhere in 4-Dimensional: row key, column family (table name), co- • Supports max of 1,000 tables in each instance.
the world. Use for ”hot data” lumn name, timestamp. • Choose row keys that don’t follow predictable order
• Regional: high local performance for region • Row key uniquely identifies an entity and columns • Can use up to around 100 column families
• Nearline: storage for data accessed less than once contain individual values for each row. • Column Qualifiers: can create as many as you need
a month (archival) • Similar columns are grouped into column families. in each row, but should avoid splitting data across
• Coldline: less than once a year (archival) • Each column is identified by a combination of the more column qualifiers than necessary (16 KB)
column family and a column qualifier, which is a • Tables are sparse. Empty columns don’t take up
CloudSQL and Cloud Spanner- Relational DBs unique name within the column family. any space. Can create large number of columns, even
if most columns are empty in most rows.
Cloud SQL • Field promotion (shift a column as part of the row
Fully-managed relational database service (supports key) and salting (remainder of division of hash of
MySQL/PostgreSQL). Use for relational data: tables, timestamp plus row key) are ways to help design
rows and columns, and super structured data. SQL row keys.
compatible and can update fields. Not scalable (small
storage- GBs). Good for web frameworks and OLTP wor- For time-series data, use tall/narrow tables.
kloads (not OLAP). Can use Cloud Storage Transfer Denormalize- prefer multiple tall and narrow tables
Service or Transfer Appliance to data into Cloud
Storage (from AWS, local, another bucket). Use gsutil if
copying files over from on-premise.
Load Balancing Automatically manages splitting,
merging, and rebalancing. The master process balances
Cloud Spanner
workload/data volume within clusters. The master
Google-proprietary offering, more advanced than Cloud
splits busier/larger tablets in half and merges less-
SQL. Mission-critical, relational database. Supports ho-
accessed/smaller tablets together, redistributing them
rizontal scaling. Combines benefits of of relational and
between nodes.
non-relational databases.
Best write performance can be achieved by using row
• Ideal: relational, structured, and semi-structured
keys that do not follow a predictable order and grouping
data that requires high availability, strong consis-
related rows so they are adjacent to one another, which re-
tency, and transactional reads and writes
sults in more efficient multiple row reads at the same time.
• Avoid: data is not relational or structured, want
an open source RDBMS, strong consistency and
Security Security can be managed at the project and
high availability is unnecessary
instance level. Does not support table-level, row-level,
column-level, or cell-level security restrictions.
Cloud Spanner Data Model
A database can contain 1+ tables. Tables look like relati-
Other Storage Options
onal database tables. Data is strongly typed: must define
- Need SQL for OLTP: CloudSQL/Cloud Spanner.
a schema for each database and that schema must specify
- Need interactive querying for OLAP: BigQuery.
the data types of each column of each table.
- Need to store blobs larger than 10 MB: Cloud Storage.
Parent-Child Relationships: Can optionally define re-
- Need to store structured objects in a document database,
lationships between tables to physically co-locate their
with ACID and SQL-like queries: Cloud Datastore.
rows for efficient retrieval (data locality: physically sto-
ring 1+ rows of a table with a row from another table.
BigQuery BigQuery Part II Optimizing BigQuery
Scalable, fully-managed Data Warehouse with extremely Partitioned tables Controlling Costs
fast SQL queries. Allows querying for massive volumes Special tables that are divided into partitions based on - Avoid SELECT * (full scan), select only columns needed
of data at fast speeds. Good for OLAP workloads a column or partition key. Data is stored on different (SELECT * EXCEPT)
(petabyte-scale), Big Data exploration and processing, directories and specific queries will only run on slices of - Sample data using preview options for free
and reporting via Business Intelligence (BI) tools. Sup- data, which improves query performance and reduces - Preview queries to estimate costs (dryrun)
ports SQL querying for non-relational data. Relatively costs. Note that the partitions will not be of the same - Use max bytes billed to limit query costs
cheap to store, but costly for querying/processing. Good size. BigQuery automatically does this. - Don’t use LIMIT clause to limit costs (still full scan)
for analyzing historical data. Each partitioned table can have up to 2,500 partitions - Monitor costs using dashboards and audit logs
(2500 days or a few years). The daily limit is 2,000 - Partition data by date
Data Model partition updates per table, per day. The rate limit: 50 - Break query results into stages
Data tables are organized into units called datasets, which partition updates every 10 seconds. - Use default table expiration to delete unneeded data
are sets of tables and views. A table must belong to da- - Use streaming inserts wisely
taset and a datatset must belong to a porject. Tables Two types of partitioned tables: - Set hard limit on bytes (members) processed per day
contain records with rows and columns (fields). • Ingestion Time: Tables partitioned based on
You can load data into BigQuery via two options: batch the data’s ingestion (load) date or arrival date. Query Performance
loading (free) and streaming (costly). Each partitioned table will have pseudocolumn Generally, queries that do less work perform better.
PARTITIONTIME, or time data was loaded into
Security table. Pseudocolumns are reserved for the table and Input Data/Data Sources
BigQuery uses IAM to manage access to resources. The cannot be used by the user. - Avoid SELECT *
three types of resources in BigQuery are organizations, • Partitioned Tables: Tables that are partitioned - Prune partitioned queries (for time-partitioned table,
projects, and datasets. Security can be applied at the based on a TIMESTAMP or DATE column. use PARTITIONTIME pseudo column to filter partitions)
project and dataset level, but not table or view level. - Denormalize data (use nested and repeated fields)
Windowing: window functions increase the efficiency - Use external data sources appropriately
Views and reduce the complexity of queries that analyze - Avoid excessive wildcard tables
A view is a virtual table defined by a SQL query. When partitions (windows) of a dataset by providing complex
you create a view, you query it in the same way you operations without the need for many intermediate SQL Anti-Patterns
query a table. Authorized views allow you to calculations. They reduce the need for intermediate - Avoid self-joins., use window functions (perform calcu-
share query results with particular users/groups tables to store temporary data lations across many table rows related to current row)
without giving them access to underlying data. - Partition/Skew: avoid unequally sized partitions, or
When a user queries the view, the query results contain Bucketing when a value occurs more often than any other value
data only from the tables and fields specified in the query Like partitioning, but each split/partition should be the - Cross-Join: avoid joins that generate more outputs than
that defines the view. same size and is based on the hash function of a column. inputs (pre-aggregate data or use window function)
Each bucket is a separate file, which makes for more Update/Insert Single Row/Column: avoid point-specific
Billing efficient sampling and joining data. DML, instead batch updates and inserts
Billing is based on storage (amount of data stored),
querying (amount of data/number of bytes processed by Querying Managing Query Outputs
query), and streaming inserts. Storage options are ac- After loading data into BigQuery, you can query using - Avoid repeated joins and using the same subqueries
tive and long-term (modified or not past 90 days). Query Standard SQL (preferred) or Legacy SQL (old). Query - Writing large sets has performance/cost impacts. Use
options are on-demand and flat-rate. jobs are actions executed asynchronously to load, export, filters or LIMIT clause. 128MB limit for cached results
Query costs are based on how much data you read/process, query, or copy data. Results can be saved to permanent - Use LIMIT clause for large sorts (Resources Exceeded)
so if you only read a section of a table (partition), (store) or temporary (cache) tables. 2 types of queries:
your costs will be reduced. Any charges occurred • Interactive: query is executed immediately, counts Optimizing Query Computation
are billed to the attached billing account. Expor- toward daily/concurrent usage (default) - Avoid repeatedly transforming data via SQL queries
ting/importing/copying data is free. • Batch: batches of queries are queued and the query - Avoid JavaScript user-defined functions
starts when idle resources are available, only counts - Use approximate aggregation functions (approx count)
for daily and switches to interactive if idle for 24 - Order query operations to maximize performance. Use
hours ORDER BY only in outermost query, push complex ope-
rations to end of the query.
Wildcard Tables - For queries that join data from multiple tables, optimize
Used if you want to union all similar tables with similar join patterns. Start with the largest table.
names. ’*’ (e.g. project.dataset.Table*)
DataStore DataFlow Pub/Sub
NoSQL document database that automatically handles Managed service for developing and executing data Asynchronous messaging service that decouples senders
sharding and replication (highly available, scalable and processing patterns (ETL) (based on Apache Beam) for and receivers. Allows for secure and highly available com-
durable). Supports ACID transactions, SQL-like queries. streaming and batch data. Preferred for new Hadoop or munication between independently written applications.
Query execution depends on size of returned result, not Spark infrastructure development. Usually site between A publisher app creates and sends messages to a to-
size of dataset. Ideal for ”needle in a haystack”operation front-end and back-end storage solutions. pic. Subscriber applications create a subscription to a
and applications that rely on highly available structured topic to receive messages from it. Communication can
data at scale Concepts be one-to-many (fan-out), many-to-one (fan-in), and
Pipeline: encapsulates series of computations that many-to-many. Gaurantees at least once delivery before
Data Model accepts input data from external sources, transforms data deletion from queue.
Data objects in Datastore are known as entities. An to provide some useful intelligence, and produce output
entity has one or more named properties, each of which PCollections: abstraction that represents a potentially Scenarios
can have one or more values. Each entity in has a key distributed, multi-element data set, that acts as the • Balancing workloads in network clusters- queue can
that uniquely identifies it. You can fetch an individual pipeline’s data. PCollection objects represent input, efficiently distribute tasks
entity using the entity’s key, or query one or more entities intermediate, and output data. The edges of the pipeline. • Implementing asynchronous workflows
based on the entities’ keys or property values. Transforms: operations in pipeline. A transform • Data streaming from various processes or devices
takes a PCollection(s) as input, performs an operation • Reliability improvement- in case zone failure
Ideal for highly structured data at scale: product that you specify on each element in that collection, • Distributing event notifications
catalogs, customer experience based on users past acti- and produces a new output PCollection. Uses the • Refreshing distributed caches
vities/preferences, game states. Don’t use if you need ”what/where/when/how”model. Nodes in the pipeline. • Logging to multiple systems
extremely low latency or analytics (complex joins, etc). Composite transforms are multiple transforms: combi-
ning, mapping, shuffling, reducing, or statistical analysis. Benefits/Features
Pipeline I/O: the source/sink, where the data flows Unified messaging, global presence, push- and pull-style
in and out. Supports read and write transforms for a subscriptions, replicated storage and guaranteed at-least-
number of common data storage types, as well as custom. once message delivery, encryption of data at rest/transit,
easy-to-use REST/JSON API

Data Model
Topic, Subscription, Message (combination of data
and attributes that a publisher sends to a topic and is
eventually delivered to subscribers), Message Attribute
(key-value pair that a publisher can define for a message)
DataProc
Message Flow
Fully-managed cloud service for running Spark and • Publisher creates a topic in the Cloud Pub/Sub ser-
Hadoop clusters. Provides access to Hadoop cluster on vice and sends messages to the topic.
GCP and Hadoop-ecosystem tools (Pig, Hive, and Spark). • Messages are persisted in a message store until they
Can be used to implement ETL warehouse solution. are delivered and acknowledged by subscribers.
• The Pub/Sub service forwards messages from a topic
Preferred if migrating existing on-premise Hadoop or to all of its subscriptions, individually. Each subs-
Spark infrastructure to GCP without redevelopment ef- cription receives messages either by pushing/pulling.
fort. Dataflow is preferred for a new development. Windowing • The subscriber receives pending messages from its
Windowing a PCollection divides the elements into win- subscription and acknowledges message.
dows based on the associated event time for each element. • When a message is acknowledged by the subscriber,
Especially useful for PCollections with unbounded size, it is removed from the subscription’s message queue.
since it allows operating on a sub-group (mini-batches).

Triggers
Allows specifying a trigger to control when (in processing
time) results for the given window can be produced. If
unspecified, the default behavior is to trigger first when
the watermark passes the end of the window, and then
trigger again every time there is late arriving data.
ML Engine ML Concepts/Terminology TensorFlow
Managed infrastructure of GCP with the power and Features: input data used by the ML model Tensorflow is an open source software library for numeri-
flexibility of TensorFlow. Can use it to train ML models Feature Engineering: transforming input features to cal computation using data flow graphs. Everything in
at scale and host trained models to make predictions be more useful for the models. e.g. mapping categories to TF is a graph, where nodes represent operations on data
about new data in the cloud. Supported frameworks buckets, normalizing between -1 and 1, removing null and edges represent the data. Phase 1 of TF is building
include Tensorflow, scikit-learn and XGBoost. Train/Eval/Test: training is data used to optimize the up a computation graph and phase 2 is executing it. It is
model, evaluation is used to asses the model on new data also distributed, meaning it can run on either a cluster of
ML Workflow during training, test is used to provide the final result machines or just a single machine.
Evaluate Problem: What is the problem and is ML the Classification/Regression: regression is prediction a
best approach? How will you measure model’s success? number (e.g. housing price), classification is prediction Tensors
Choosing Development Environment: Supports from a set of categories(e.g. predicting red/blue/green) In a graph, tensors are the edges and are multidimensional
Python 2 and 3 and supports TF, scikit-learn, XGBoost Linear Regression: predicts an output by multiplying data arrays that flow through the graph. Central unit
as frameworks. and summing input features with weights and biases of data in TF and consists of a set of primitive values
Data Preparation and Exploration: Involves gathe- Logistic Regression: similar to linear regression but shaped into an array of any number of dimensions.
ring, cleaning, transforming, exploring, splitting, and predicts a probability
preprocessing data. Also includes feature engineering. Neural Network: composed of neurons (simple building A tensor is characterized by its rank (# dimensions
Model Training/Testing: Provide access to train/test blocks that actually “learn”), contains activation functi- in tensor), shape (# of dimensions and size of each di-
data and train them in batches. Evaluate progress/results ons that makes it possible to predict non-linear outputs mension), data type (data type of each element in tensor).
and adjust the model as needed. Export/save trained Activation Functions: mathematical functions that in-
model (250 MB or smaller to deploy in ML Engine). troduce non-linearity to a network e.g. RELU, tanh Placeholders and Variables
Hyperparameter Tuning: hyperparameters are varia- Sigmoid Function: function that maps very negative Variables: best way to represent shared, persistent state
bles that govern the training process itself, not related to numbers to a number very close to 0, huge numbers close manipulated by your program. These are the parameters
the training data itself. Usually constant during training. to 1, and 0 to .5. Useful for predicting probabilities of the ML model are altered/trained during the training
Prediction: host your trained ML models in the cloud Gradient Descent/Backpropagation: fundamental process. Training variables.
and use the Cloud ML prediction service to infer target loss optimizer algorithms, of which the other optimizers Placeholders: way to specify inputs into a graph that
values for new data are usually based. Backpropagation is similar to gradient hold the place for a Tensor that will be fed at runtime.
descent but for neural nets They are assigned once, do not change after. Input nodes
ML APIs Optimizer: operation that changes the weights and bia-
Speech-to-Text: speech-to-text conversion ses to reduce loss e.g. Adagrad or Adam Popular Architectures
Text-to-Speech: text-to-speech conversion Weights / Biases: weights are values that the input fea- Linear Classifier: takes input features and combines
Translation: dynamically translate between languages tures are multiplied by to predict an output value. Biases them with weights and biases to predict output value
Vision: derive insight (objects/text) from images are the value of the output given a weight of 0. DNNClassifier: deep neural net, contains intermediate
Natural Language: extract information (sentiment, Converge: algorithm that converges will eventually re- layers of nodes that represent “hidden features” and
intent, entity, and syntax) about the text: people, places.. ach an optimal answer, even if very slowly. An algorithm activation functions to represent non-linearity
Video Intelligence: extract metadata from videos that doesn’t converge may never reach an optimal answer. ConvNets: convolutional neural nets. popular for image
Learning Rate: rate at which optimizers change weights classification.
Cloud Datalab and biases. High learning rate generally trains faster but Transfer Learning: use existing trained models as
Interactive tool (run on an instance) to explore, analyze, risks not converging, whereas a lower rate trains slower starting points and add additional layers for the specific
transform and visualize data and build machine learning Overfitting: model performs great on the input data but use case. idea is that highly trained existing models know
models. Built on Jupyter. Datalab is free but may incus poorly on the test data (combat by dropout, early stop- general features that serve as a good starting point for
costs based on usages of other services. ping, or reduce # of nodes or layers) training a small network on specific examples
Bias/Variance: how much output is determined by the RNN: recurrent neural nets, designed for handling a
Data Studio features. more variance often can mean overfitting, more sequence of inputs that have “memory” of the sequence.
Turns your data into informative dashboards and reports. bias can mean a bad model LSTMs are a fancy version of RNNs, popular for NLP
Updates to data will automatiaclly update in dashbaord. Regularization: variety of approaches to reduce over- GAN: general adversarial neural net, one model creates
Query cache remembers the queries (requests for data) fitting, including adding the weights to the loss function, fake examples, and another model is served both fake
issued by the components in a report (lightning bolt)- randomly dropping layers (dropout) example and real examples and is asked to distinguish
turn off for data that changes frequently, want to prio- Ensemble Learning: training multiple models with dif- Wide and Deep: combines linear classifiers with
ritize freshness over performance, or using a data source ferent parameters to solve the same problem deep neural net classifiers, ”wide”linear parts represent
that incurs usage costs (e.g. BigQuery). Prefetch cache Embeddings: mapping from discrete objects, such as memorizing specific examples and “deep” parts represent
predicts data that could be requested by analyzing the di- words, to vectors of real numbers. useful because classifi- understanding high level features
mensions, metrics, filters, and date range properties and ers/neural networks work well on vectors of real numbers
controls on the report.
Case Study: Flowlogistic I Case Study: Flowlogistic II Flowlogistic Potential Solution
Company Overview Existing Technical Environment 1. Cloud Dataproc handles the existing workloads and
Flowlogistic is a top logistics and supply chain provider. Flowlogistic architecture resides in a single data center: produces results as before (using a VPN).
They help businesses throughout the world manage their • Databases: 2. At the same time, data is received through the
resources and transport them to their final destination. – 8 physical servers in 2 clusters data center, via either stream or batch, and sent
The company has grown rapidly, expanding their offerings ∗ SQL Server: inventory, user/static data to Pub/Sub.
to include rail, truck, aircraft, and oceanic shipping. – 3 physical servers 3. Pub/Sub encrypts the data in transit and at rest.
∗ Cassandra: metadata, tracking messages 4. Data is fed into Dataflow either as stream/batch
Company Background – 10 Kafka servers: tracking message aggregation data.
The company started as a regional trucking company, and batch insert 5. Dataflow processes the data and sends the cleaned
and then expanded into other logistics markets. Because • Application servers: customer front end, middleware data to BigQuery (again either as stream or batch).
they have not updated their infrastructure, managing and for order/customs 6. Data can then be queried from BigQuery and pre-
tracking orders and shipments has become a bottleneck. – 60 virtual machines across 20 physical servers dictive analysis can begin (using ML Engine, etc..)
To improve operations, Flowlogistic developed proprie- ∗ Tomcat: Java services
tary technology for tracking shipments in real time at ∗ Nginx: static content
the parcel level. However, they are unable to deploy it ∗ Batch servers
because their technology stack, based on Apache Kafka, • Storage appliances
cannot support the processing volume. In addition, – iSCSI for virtual machine (VM) hosts
Flowlogistic wants to further analyze their orders and – Fibre Channel storage area network (FC SAN):
shipments to determine how best to deploy their resources. SQL server storage
– Network-attached storage (NAS): image sto-
Solution Concept rage, logs, backups Note: All services are fully managed., easily scalable, and
Flowlogistic wants to implement two concepts in the cloud: • 10 Apache Hadoop / Spark Servers can handle streaming/batch data. All technical require-
• Use their proprietary technology in a real-time – Core Data Lake ments are met.
inventory-tracking system that indicates the loca- – Data analysis workloads
tion of their loads. • 20 miscellaneous servers
• Perform analytics on all their orders and shipment – Jenkins, monitoring, bastion hosts, security
logs (structured and unstructured data) to deter- scanners, billing software
mine how best to deploy resources, which customers
to target, and which markets to expand into. They CEO Statement
also want to use predictive analytics to learn earlier We have grown so quickly that our inability to up-
when a shipment will be delayed. grade our infrastructure is really hampering further
growth/efficiency. We are efficient at moving shipments
Business Requirements around the world, but we are inefficient at moving data
-Build a reliable and reproducible environment with around. We need to organize our information so we can
scaled parity of production more easily understand where our customers are and
-Aggregate data in a centralized Data Lake for analysis what they are shipping.
-Perform predictive analytics on future shipments
-Accurately track every shipment worldwide CTO Statement
-Improve business agility and speed of innovation through IT has never been a priority for us, so as our data
rapid provisioning of new resources has grown, we have not invested enough in our tech-
-Analyze/optimize architecture cloud performance nology. I have a good staff to manage IT, but they
-Migrate fully to the cloud if all other requirements met are so busy managing our infrastructure that I cannot
get them to do the things that really matter, such as
Technical Requirements organizing our data, building the analytics, and figu-
-Handle both streaming and batch data ring out how to implement the CFO’s tracking technology.
-Migrate existing Hadoop workloads
-Ensure architecture is scalable and elastic to meet the CFO Statement
changing demands of the company Part of our competitive advantage is that we penalize our-
-Use managed services whenever possible selves for late shipments/deliveries. Knowing where our
-Encrypt data in flight and at rest shipments are at all times has a direct correlation to our
-Connect a VPN between the production data center and bottom line and profitability. Additionally, I don’t want
cloud environment to commit capital to building out a server environment.
Case Study: MJTelco I Case Study: MJTelco II Choosing a Storage Option
Company Overview Technical Requirements Need Open Source GCP Solution
MJTelco is a startup that plans to build networks -Ensure secure and efficient transport and storage of Compute, Persistent Disks, Persistent Disks,
in rapidly growing, underserved markets around the telemetry data. Block Storage SSD SSD
world. The company has patents for innovative optical -Rapidly scale instances to support between 10,000 and Media, Blob Filesystem, Cloud Storage
communications hardware. Based on these patents, they 100,000 data providers with multiple flows each. Storage HDFS
can create many reliable, high-speed backbone links with -Allow analysis and presentation against data tables SQL Interface Hive BigQuery
inexpensive hardware. tracking up to 2 years of data storing approximately on File Data
100m records/day. Support rapid iteration of monitoring Document CouchDB, Mon- DataStore
Company Background infrastructure focused on awareness of data pipeline DB, NoSQL goDB
Founded by experienced telecom executives, MJTelco uses problems both in telemetry flows and in production Fast Scanning HBase, Cassan- BigTable
technologies originally developed to overcome communica- learning cycles. NoSQL dra
tions challenges in space. Fundamental to their operation, OLTP RDBMS- CloudSQL,
they need to create a distributed data infrastructure that CEO Statement MySQL Cloud Spanner
drives real-time analysis and incorporates machine lear- Our business model relies on our patents, analytics and OLAP Hive BigQuery
ning to continuously optimize their topologies. Because dynamic machine learning. Our inexpensive hardware
Cloud Storage: unstructured data (blob)
their hardware is inexpensive, they plan to overdeploy is organized to be highly reliable, which gives us cost
CloudSQL: OLTP, SQL, structured and relational data,
the network allowing them to account for the impact of advantages. We need to quickly stabilize our large
no need for horizontal scaling
dynamic regional politics on location availability and cost. distributed data pipelines to meet our reliability and
Cloud Spanner: OLTP, SQL, structured and relational
capacity commitments.
data, need for horizontal scaling, between RDBMS/big
Their management and operations teams are situated
data
all around the globe creating many-to-many relationship CTO Statement
Cloud Datastore: NoSQL, document data, key-value
between data consumers and providers in their system. Our public cloud services must operate as advertised. We
structured but non-relational data (XML, HTML, query
After careful consideration, they decided public cloud is need resources that scale and keep our data secure. We
depends on size of result (not dataset), fast to read/slow
the perfect environment to support their needs. also need environments in which our data scientists can
to write
carefully study and quickly adapt our models. Because
BigTable: NoSQL, key-value data, columnar, good for
Solution Concept we rely on automation to process our data, we also need
sparse data, sensitive to hot spotting, high throughput
MJTelco is running a successful proof-of-concept (PoC) our development and test environments to work as we
and scalability for non-structured key/value data, where
project in its labs. They have two primary needs: iterate.
each value is typically no larger than 10 MB
• Scale and harden their PoC to support significantly
more data flows generated when they ramp to more CFO Statement
than 50,000 installations. This project is too large for us to maintain the hardware
• Refine their machine-learning cycles to verify and and software required for the data and analysis. Also,
improve the dynamic models they use to control we cannot afford to staff an operations team to monitor
topology definitions. MJTelco will also use three se- so many data feeds, so we will rely on automation and
parate operating environments – development/test, infrastructure. Google Cloud’s machine learning will al-
staging, and production – to meet the needs of low our quantitative researchers to work on our high-value
running experiments, deploying new features, and problems instead of problems with our data pipelines.
serving production customers.

Business Requirements
-Scale up their production environment with minimal cost,
instantiating resources when and where needed in an un-
predictable, distributed telecom user community.
-Ensure security of their proprietary data to protect their
leading-edge machine learning and analysis.
-Provide reliable and timely access to data for analysis
from distributed research workers.
-Maintain isolated environments that support rapid ite-
ration of their machine-learning models without affecting
their customers.
Solutions Solutions- Part II Solutions- Part III
Data Lifecycle Storage Process and Analyze
At each stage, GCP offers multiple services to manage Cloud Storage: durable and highly-available object sto- In order to derive business value and insights from data,
your data. rage for structured and unstructured data you must transform and analyze it. This requires a proces-
1. Ingest: first stage is to pull in the raw data, such Cloud SQL: fully managed, cloud RDBMS that offers sing framework that can either analyze the data directly
as streaming data from devices, on-premises batch both MySQL and PostgreSQL engines with built-in sup- or prepare the data for downstream analysis, as well as
data, application logs, or mobile-app user events and port for replication, for low-latency, transactional, relati- tools to analyze and understand processing results.
analytics onal database workloads. supports RDBMS workloads up • Processing: data from source systems is cleansed,
2. Store: after the data has been retrieved, it needs to 10 TB (storing financial transactions, user credentials, normalized, and processed across multiple machines,
to be stored in a format that is durable and can be customer orders) and stored in analytical systems
easily accessed BigTable: managed, high-performance NoSQL database • Analysis: processed data is stored in systems that
3. Process and Analyze: the data is transformed service designed for terabyte- to petabyte-scale workloads. allow for ad-hoc querying and exploration
from raw form into actionable information suitable for large-scale, high-throughput workloads such • Understanding: based on analytical results, data
4. Explore and Visualize: convert the results of the as advertising technology or IoT data infrastructure. does is used to train and test automated machine-learning
analysis into a format that is easy to draw insights not support multi-row transactions, SQL queries or joins, models
from and to share with colleagues and peers consider Cloud SQL or Cloud Datastore instead
Cloud Spanner: fully managed relational database ser-
vice for mission-critical OLTP applications. horizontally
scalable, and built for strong consistency, high availability,
and global scale. supports schemas, ACID transactions,
and SQL queries (use for retail and global supply chain,
ad tech, financial services)
BigQuery: stores large quantities of data for query and
analysis instead of transactional processing Processing
• Cloud Dataproc: migrate your existing Hadoop or
Spark deployments to a fully-managed service that
automates cluster creation, simplifies configuration
and management of your cluster, has built-in moni-
toring and utilization reports, and can be shutdown
when not in use
• Cloud Dataflow: designed to simplify big data for
both streaming and batch workloads, focus on filte-
ring, aggregating, and transforming your data
• Cloud Dataprep: service for visually exploring,
Exploration and Visualization Data exploration and cleaning, and preparing data for analysis. can
Ingest visualization to better understand the results of the pro- transform data of any size stored in CSV, JSON, or
There are a number of approaches to collect raw data, cessing and analysis. relational-table formats
based on the data’s size, source, and latency. • Cloud Datalab: interactive web-based tool that
• Application: data from application events, log fi- you can use to explore, analyze and visualize data Analyzing and Querying
les or user events, typically collected in a push mo- built on top of Jupyter notebooks. runs on a VM • BigQuery: query using SQL, all data encrypted,
del, where the application calls an API to send and automatically saved to persistent disks and can user analysis, device and operational metrics, busi-
the data to storage (Stackdriver Logging, Pub/Sub, be stored in GC Source Repo (git repo) ness intelligence
CloudSQL, Datastore, Bigtable, Spanner) • Data Studio: drag-and-drop report builder that • Task-Specific ML: Vision, Speech, Natural Lan-
• Streaming: data consists of a continuous stream you can use to visualize data into reports and dash- guage, Translation, Video Intelligence
of small, asynchronous messages. Common uses in- boards that can then be shared with others, bac- • ML Engine: managed platform you can use to run
clude telemetry, or collecting data from geographi- ked by live data, that can be shared and updated custom machine learning models at scale
cally dispersed devices (IoT) and user events and easily. data sources can be data files, Google She-
analytics (Pub/Sub) ets, Cloud SQL, and BigQuery. supports query and
• Batch: large amounts of data are stored in a set of streaming data: use Pub/Sub and Dataflow in combination
prefetch cache: query remembers previous queries
files that are transferred to storage in bulk. com- and if data not found, goes to prefetch cache, which
mon use cases include scientific workloads, backups, predicts data that could be requested (disable if data
migration (Storage, Transfer Service, Appliance) changes frequently/using data source incurs charges)
CME 106 – Introduction to Probability and Statistics for Engineers https://stanford.edu/~shervine

VIP Cheatsheet: Statistics Distribution Sample size σ2 Statistic 1 − α confidence interval

X −µ
h i
any known ∼ N (0,1) X − zα √σ ,X + zα √σ
√σ 2 n 2 n
n
Shervine Amidi Xi ∼ N (µ, σ)
X −µ
h i
small unknown ∼ tn−1 X − tα √s ,X + tα √s
√s 2 n 2 n
August 13, 2018 n

X −µ
h i
known ∼ N (0,1) X − zα √σ ,X + zα √σ
√σ 2 n 2 n
n
Xi ∼ any large
Parameter estimation X −µ
h i
unknown ∼ N (0,1) X − zα √s ,X + zα √s
√s 2 n 2 n
r Random sample – A random sample is a collection of n random variables X1 , ..., Xn that n
are independent and identically distributed with X.
Xi ∼ any small any Go home! Go home!
r Estimator – An estimator θ̂ is a function of the data that is used to infer the value of an
unknown parameter θ in a statistical model.
r Confidence interval for the variance – The single-line table below sums up the test
r Bias – The bias of an estimator θ̂ is defined as being the difference between the expected statistic to compute when determining the confidence interval for the variance.
value of the distribution of θ̂ and the true value, i.e.:
Distribution Sample size µ Statistic 1 − α confidence interval
Bias(θ̂) = E[θ̂] − θ
s2 (n − 1)
h i
s2 (n−1) s2 (n−1)
Xi ∼ N (µ,σ) any any ∼ χ2n−1 χ2
, χ2
σ2 2 1
Remark: an estimator is said to be unbiased when we have E[θ̂] = θ.
r Sample mean and variance – The sample mean and the sample variance of a random Hypothesis testing
sample are used to estimate the true mean µ and the true variance σ 2 of a distribution, are
noted X and s2 respectively, and are such that: r Errors – In a hypothesis test, we note α and β the type I and type II errors respectively. By
noting T the test statistic and R the rejection region, we have:
n n
1 X 1 X α = P (T ∈ R|H0 true) and / R|H1 true)
β = P (T ∈
X= Xi and s2 = σ̂ 2 = (Xi − X)2
n n−1
i=1 i=1 r p-value – In a hypothesis test, the p-value is the probability under the null hypothesis of
having a test statistic T at least as extreme as the one that we observed T0 . In particular, when
T follows a zero-centered symmetric distribution under H0 , we have:
r Central Limit Theorem – Let us have a random sample X1 , ..., Xn following a given
distribution with mean µ and variance σ 2 , then we have: Case Left-sided Right-sided Two-sided
p-value P (T 6 T0 |H0 true) P (T > T0 |H0 true) P (|T | > |T0 ||H0 true)
σ
 
X ∼ N µ, √
n→+∞ n
r Sign test – The sign test is a non-parametric test used to determine whether the median of
a sample is equal to the hypothesized median. By noting V the number of samples falling to
the right of the hypothesized median, we have:
Confidence intervals
r Confidence level – A confidence interval CI1−α with confidence level 1 − α of a true pa- Statistic when np < 5 Statistic when np > 5
rameter θ is such that 1 − α of the time, the true value is contained in the confidence interval:  V − n
1 2
V ∼ B n, p = 2
Z= √ ∼ N (0,1)
H0 n H0
P (θ ∈ CI1−α ) = 1 − α 2

r Confidence interval for the mean – When determining a confidence interval for the mean r Testing for the difference in two means – The table below sums up the test statistic to
µ, different test statistics have to be computed depending on which case we are in. The following compute when performing a hypothesis test where the null hypothesis is:
table sums it up: H0 : µX − µY = δ

Stanford University 1 Winter 2018


CME 106 – Introduction to Probability and Statistics for Engineers https://stanford.edu/~shervine

Distribution of Xi , Yi nX , nY 2 , σ2
σX Statistic r Sum of squared errors – By keeping the same notations, we define the sum of squared
Y
errors, also known as SSE, as follows:
(X − Y ) − δ n n
any known q ∼ N (0,1) X X
2
σX 2
σY H0 SSE = (Yi − Ŷi )2 = (Yi − (A + Bxi ))2 = SY Y − BSXY
nX
+ nY
i=1 i=1

(X − Y ) − δ
Normal large unknown q ∼ N (0,1) r Least-squares estimates – When estimating the coefficients α, β with the least-squares
s2
X
s2
Y
H0 method which is done by minimizing the SSE, we obtain the estimates A, B defined as follows:
nX
+ nY
SXY SXY
A=Y − x and B=
(X − Y ) − δ SXX SXX
small unknown σX = σY q ∼ tnX +nY −2
1 1 H0
s nX
+ nY r Key results – When σ is unknown, this parameter is estimated by the unbiased estimator
s2 defined as follows:
D−δ
Normal, paired any unknown ∼ tn−1 SY Y − BSXY s2 (n − 2)
s
√D H0 s2 = and we have ∼ χ2n−2
n n−2 σ2
Di = Xi − Yi nX = nY
The table below sums up the properties surrounding the least-squares estimates A, B when
σ is known or not:
r χ2 goodness of fit test – By noting k the number of bins, n the total number of samples,
pi the probability of success in each bin and Yi the associated number of samples, we can use
the test statistic T defined below to test whether or not there is a good fit. If npi > 5, we have: Coeff σ Statistic 1 − α confidence interval
 q q 
A−α 1 X
2
1 X
2

k
known q ∼ N (0,1) A − zα σ n
+ SXX
,A + z α σ n
+ SXX
2 2 2
X (Yi − npi )2 σ 1
+ X
T = ∼ χ2df with df = (k − 1) − #(estimated parameters) n SXX
npi H0 α
i=1  q q 
2 2
unknown q A−α ∼ tn−2 A − tα s
2
1
n
+ X
SXX
,A + t α s
2
1
n
+ X
SXX
2
s 1+ X
r Test for arbitrary trends – Given a sequence, the test for arbitrary trends is a non- n SXX
parametric test, whose aim is to determine whether the data suggest the presence of an increasing
trend: B−β
h
σ σ
i
known √ σ ∼ N (0,1) B − zα √ ,B + z α √
H0 : no trend versus H1 : there is an increasing trend 2 SXX 2 SXX
SXX
β
If we note x the number of transpositions in the sequence, the p-value is computed as: B−β
h i
s s
unknown √ s ∼ tn−2 B − tα √ ,B + t α √
p-value = P (T 6 x) 2 SXX 2 SXX
SXX

Regression analysis Correlation analysis


r Sample correlation coefficient – The correlation coefficient is in practice estimated by the
In the following section, we will note (x1 , Y1 ), ..., (xn , Yn ) a collection of n data points.
sample correlation coefficient, often noted r or ρ̂, which is defined as:
r Simple linear model – Let X be a deterministic variable and Y a dependent random √
variable. In the context of a simple linear model, we assume that Y is linked to X via the SXY r n−2
regression coefficients α, β and a random variable e ∼ N (0,σ), where e is referred as the error. r = ρ̂ = √ with √ ∼ tn−2 for H0 : ρ = 0
SXX SY Y 1 − r 2 H0
We estimate Y, α, β by Ŷ , A, B and have:
zα zα 
Y = α + βX + e and Ŷi = A + Bxi r Correlation properties – By noting V1 = V − √n−3 2 2
, V2 = V + √n−3 with V = 12 ln 1+r
,
1−r
the table below sums up the key results surrounding the correlation coefficient estimate:
r Notations – Given n data points (xi , Yi ), we define SXY ,SXX and SY Y as follows:
Sample size Standardized statistic 1 − α confidence interval for ρ
n n n 1 1+ρ
  
X X X V − 2
ln 1−ρ e2V1 − 1 e2V2 − 1
SXY = (xi − x)(Yi − Y ) and SXX = (xi − x)2 and SY Y = (Yi − Y )2 large ∼ N (0,1) ,
√1 n1 e2V1 + 1 e2V2 + 1
i=1 i=1 i=1 n−3

Stanford University 2 Winter 2018


Data Science Cheatsheet
Compiled by Maverick Lin (http://mavericklin.com)
Last Updated August 13, 2018

What is Data Science? Probability Overview Descriptive Statistics


Multi-disciplinary field that brings together concepts Probability theory provides a framework for reasoning Provides a way of capturing a given data set or sample.
from computer science, statistics/machine learning, and about likelihood of events. There are two main types: centrality and variability
data analysis to understand and extract insights from the measures.
ever-increasing amounts of data. Terminology
Experiment: procedure that yields one of a possible set Centrality
Two paradigms of data research. of outcomes e.g. repeatedly tossing a die or coin Arithmetic Mean Useful to characterize symmetric
1. Hypothesis-Driven: Given a problem, what kind distributions without outliers µX = n1
P
Sample Space S: set of possible outcomes of an experi- x
of data do we need to help solve it? ment e.g. if tossing a die, S = (1,2,3,4,5,6 Geometric Mean Useful for averaging ratios. Always

2. Data-Driven: Given some data, what interesting Event E: set of outcomes of an experiment e.g. event less than arithmetic mean = n a1 a2 ...a3
problems can be solved with it? that a roll is 5, or the event that sum of 2 rolls is 7 Median Exact middle value among a dataset. Useful for
Probability of an Outcome s or P(s): number that skewed distribution or data with outliers.
The heart of data science is to always ask questions. Al- satisfies 2 properties Mode Most frequent element in a dataset.
ways be curious about the world. 1. Pfor each outcome s, 0 ≤ P(s) ≤ 1
1. What can we learn from this data? 2. p(s) = 1 Variability
2. What actions can we take once we find whatever it Standard Deviation Measures the squares differences
is we are looking for? Probability of Event E: sum of thePprobabilities of the between the individual elements
q PN and the mean
2
outcomes of the experiment: p(E) = s⊂E p(s) i=1 (xi −x)
σ= N −1
Types of Data Random Variable V: numerical function on the out- Variance V = σ 2
comes of a probability space
Structured: Data that has predefined structures. e.g.
Expected Value of Random Variable V: E(V) = Interpreting Variance
tables, spreadsheets, or relational databases. P
Unstructured Data: Data with no predefined struc- s⊂S p(s) * V(s) Variance is an inherent part of the universe. It is impossi-
ture, comes in any size or form, cannot be easily stored ble to obtain the same results after repeated observations
Independence, Conditional, Compound of the same event due to random noise/error. Variance
in tables. e.g. blobs of text, images, audio
Independent Events: A and B are independent iff: can be explained away by attributing to sampling or
Quantitative Data: Numerical. e.g. height, weight
P(A ∩ B) = P(A)P(B) measurement errors. Other times, the variance is due to
Categorical Data: Data that can be labeled or divided
P(A|B) = P(A) the random fluctuations of the universe.
into groups. e.g. race, sex, hair color.
P(B|A) = P(B)
Big Data: Massive datasets, or data that contains
; Conditional Probability: P(A|B) = P(A,B)/P(B) Correlation Analysis
greater variety arriving in increasing volumes and with
Bayes Theorem: P(A|B) = P(B|A)P(A)/P(B) Correlation coefficients r(X,Y) is a statistic that measures
ever-higher velocity (3 Vs). Cannot fit in the memory of
Joint Probability: P(A,B) = P(B|A)P(A) the degree that Y is a function of X and vice versa.
a single machine.
Marginal Probability: P(A) Correlation values range from -1 to 1, where 1 means
Data Sources/Fomats fully correlated, -1 means negatively-correlated, and 0
Probability Distributions means no correlation.
Most Common Data Formats CSV, XML, SQL,
Probability Density Function (PDF) Gives the prob- Pearson Coefficient Measures the degree of the rela-
JSON, Protocol Buffers
ability that a rv takes on the value x: pX (x) = P (X = x) tionship between linearly related variables
Data Sources Companies/Proprietary Data, APIs, Gov-
Cumulative Density Function (CDF) Gives the prob- r = Cov(X,Y )
ernment, Academic, Web Scraping/Crawling σ(X)σ(Y )
ability that a random variable is less than or equal to x: Spearman Rank Coefficient Computed on ranks and
FX (x) = P (X ≤ x) depicts monotonic relationships
Main Types of Problems Note: The PDF and the CDF of a given random variable
Two problems arise repeatedly in data science. contain exactly the same information. Note: Correlation does not imply causation!
Classification: Assigning something to a discrete set of
possibilities. e.g. spam or non-spam, Democrat or Repub-
lican, blood type (A, B, AB, O)
Regression: Predicting a numerical value. e.g. some-
one’s income, next year GDP, stock price
Data Cleaning Feature Engineering Statistical Analysis
Data Cleaning is the process of turning raw data into Feature engineering is the process of using domain knowl- Process of statistical reasoning: there is an underlying
a clean and analyzable data set. ”Garbage in, garbage edge to create features or input variables that help ma- population of possible things we can potentially observe
out.” Make sure garbage doesn’t get put in. chine learning algorithms perform better. Done correctly, and only a small subset of them are actually sampled (ide-
it can help increase the predictive power of your models. ally at random). Probability theory describes what prop-
Errors vs. Artifacts Feature engineering is more of an art than science. FE is erties our sample should have given the properties of the
1. Errors: information that is lost during acquisi- one of the most important steps in creating a good model. population, but statistical inference allows us to deduce
tion and can never be recovered e.g. power outage, As Andrew Ng puts it: what the full population is like after analyzing the sample.
crashed servers
“Coming up with features is difficult, time-consuming,
2. Artifacts: systematic problems that arise from
requires expert knowledge. ‘Applied machine learning’ is
the data cleaning process. these problems can be
basically feature engineering.”
corrected but we must first discover them
Continuous Data
Data Compatibility Raw Measures: data that hasn’t been transformed yet
Data compatibility problems arise when merging datasets. Rounding: sometimes precision is noise; round to
Make sure you are comparing ”apples to apples” and nearest integer, decimal etc..
not ”apples to oranges”. Main types of conver- Scaling: log, z-score, minmax scale
sions/unifications: Imputation: fill in missing values using mean, median, Sampling From Distributions
• units (metric vs. imperial) model output, etc.. Inverse Transform Sampling Sampling points from
• numbers (decimals vs. integers), Binning: transforming numeric features into categorical a given probability distribution is sometimes necessary
• names (John Smith vs. Smith, John), ones (or binned) e.g. values between 1-10 belong to A, to run simulations or whether your data fits a particular
• time/dates (UNIX vs. UTC vs. GMT), between 10-20 belong to B, etc. distribution. The general technique is called inverse
• currency (currency type, inflation-adjusted, divi- Interactions: interactions between features: e.g. sub- transform sampling or Smirnov transform. First draw
dends) traction, addition, multiplication, statistical test a random number p between [0,1]. Compute value x
Statistical: log/power transform (helps turn skewed such that the CDF equals p: FX (x) = p. Use x as the
Data Imputation distributions more normal), Box-Cox value to be the random value drawn from the distribution
Process of dealing with missing values. The proper meth- Row Statistics: number of NaN’s, 0’s, negative values, described by FX (x).
ods depend on the type of data we are working with. Gen- max, min, etc
eral methods include: Dimensionality Reduction: using PCA, clustering, Monte Carlo Sampling In higher dimensions, correctly
• Drop all records containing missing data factor analysis etc sampling from a given distribution becomes more tricky.
• Heuristic-Based: make a reasonable guess based on Generally want to use Monte Carlo methods, which
knowledge of the underlying domain Discrete Data typically follow these rules: define a domain of possible
• Mean Value: fill in missing data with the mean Encoding: since some ML algorithms cannot work on inputs, generate random inputs from a probability
• Random Value categorical data, we need to turn categorical data into nu- distribution over the domain, perform a deterministic
• Nearest Neighbor: fill in missing data using similar merical data or vectors calculation, and analyze the results.
data points Ordinal Values: convert each distinct feature into a ran-
• Interpolation: use a method like linear regression to dom number (e.g. [r,g,b] becomes [1,2,3])
predict the value of the missing data One-Hot Encoding: each of the m features becomes a
vector of length m with containing only one 1 (e.g. [r, g,
Outlier Detection b] becomes [[1,0,0],[0,1,0],[0,0,1]])
Outliers can interfere with analysis and often arise from Feature Hashing Scheme: turns arbitrary features into
mistakes during data collection. It makes sense to run a indices in a vector or matrix
”sanity check”. Embeddings: if using words, convert words to vectors
(word embeddings)
Miscellaneous
Lowercasing, removing non-alphanumeric, repairing,
unidecode, removing unknown characters

Note: When cleaning data, always maintain both the raw


data and the cleaned version(s). The raw data should be
kept intact and preserved for future use. Any type of data
cleaning/analysis should be done on a copy of the raw
data.
Classic Statistical Distributions Modeling- Overview Modeling- Philosophies
Binomial Distribution (Discrete) Modeling is the process of incorporating information into Modeling is the process of incorporating information
Assume X is distributed Bin(n,p). X is the number of a tool which can forecast and make predictions. Usually, into a tool which can forecast and make predictions.
”successes” that we will achieve in n independent trials, we are dealing with statistical modeling where we want Designing and validating models is important, as well as
where each trial is either a success or failure and each to analyze relationships between variables. Formally, we evaluating the performance of models. Note that the best
success occurs with the same probability p and each want to estimate a function f (X) such that: forecasting model may not be the most accurate one.
failure occurs with probability q=1-p.
Y = f (X) + 
PDF: P (X = x) = nk px (1 − p)n−x

Philosophies of Modeling
EV: µ = np Variance = npq where X = (X1 , X2 , ...Xp ) represents the input variables, Occam’s Razor Philosophical principle that the simplest
Y represents the output variable, and  represents random explanation is the best explanation. In modeling, if we
Normal/Gaussian Distribution (Continuous) error. are given two models that predicts equally well, we should
Assume X in distributed N (µ, σ 2 ). It is a bell-shaped choose the simpler one. Choosing the more complex one
and symmetric distribution. Bulk of the values lie close Statistical learning is set of approaches for estimating can often result in overfitting.
to the mean and no value is too extreme. Generalization this f (X). Bias Variance Trade-Off Inherent part of predictive
of the binomial distribution2 as 2n → ∞. modeling, where models with lower bias will have higher
PDF: P (x) = σ√12π e−(x−µ) /2σ Why Estimate f(X)? variance and vice versa. Goal is to achieve low bias and
EV: µ Variance: σ 2 Prediction: once we have a good estimate fˆ(X), we can low variance.
Implications: 68%-95%-99% rule. 68% of probability use it to make predictions on new data. We treat fˆ as a • Bias: error from incorrect assumptions to make tar-
mass fall within 1σ of the mean, 95% within 2σ, and black box, since we only care about the accuracy of the get function easier to learn (high bias → missing rel-
99.7% within 3σ. predictions, not why or how it works. evant relations or underfitting)
Inference: we want to understand the relationship • Variance: error from sensitivity to fluctuations in
Poisson Distribution (Discrete) between X and Y. We can no longer treat fˆ as a black the dataset, or how much the target estimate would
Assume X is distributed Pois(λ). Poisson expresses box since we want to understand how Y changes with differ if different training data was used (high vari-
the probability of a given number of events occurring respect to X = (X1 , X2 , ...Xp ) ance → modeling noise or overfitting)
in a fixed interval of time/space if these events occur
independently and with a known constant rate λ. More About 
−λ x
PDF: P (x) = e x!λ EV: λ Variance = λ The error term  is composed of the reducible and irre-
ducible error, which will prevent us from ever obtaining a
Power Law Distributions (Discrete) perfect fˆ estimate.
Many data distributions have much longer tails than • Reducible: error that can potentially be reduced
the normal or Poisson distributions. In other words, by using the most appropriate statistical learning
the change in one quantity varies as a power of another technique to estimate f . The goal is to minimize
quantity. It helps measure the inequality in the world. the reducible error.
e.g. wealth, word frequency and Pareto Principle (80/20 • Irreducible: error that cannot be reduced no
No Free Lunch Theorem No single machine learning
Rule) matter how well we estimate f . Irreducible error is
algorithm is better than all the others on all problems.
PDF: P(X=x) = cx−α , where α is the law’s exponent unknown and unmeasurable and will always be an
It is common to try multiple models and find one that
and c is the normalizing constant upper bound for .
works best for a particular problem.
Note: There will always be trade-offs between model
Thinking Like Nate Silver
flexibility (prediction) and model interpretability (infer-
1. Think Probabilistically Probabilistic forecasts are
ence). This is just another case of the bias-variance trade-
more meaningful than concrete statements and should be
off. Typically, as flexibility increases, interpretability de-
reported as probability distributions (including σ along
creases. Much of statistical learning/modeling is finding a
with mean prediction µ.
way to balance the two.
2. Incorporate New Information Use live models,
which continually updates using new information. To up-
date, use Bayesian reasoning to calculate how probabilities
change in response to new evidence.
3. Look For Consensus Forecast Use multiple distinct
sources of evidence. Ssome models operate this way, such
as boosting and bagging, which uses large number of weak
classifiers to produce a strong one.
Modeling- Taxonomy Modeling- Evaluation Metrics Modeling- Evaluation Environment
There are many different types of models. It is important Need to determine how good our model is. Best way to Evaluation metrics provides use with the tools to estimate
to understand the trade-offs and when to use a certain assess models is out-of-sample predictions (data points errors, but what should be the process to obtain the
type of model. your model has never seen). best estimate? Resampling involves repeatedly drawing
samples from a training set and refitting a model to each
Parametric vs. Nonparametric Classification sample, which provides us with additional information
• Parametric: models that first make an assumption Predicted Yes Predicted No
compared to fitting the model once, such as obtaining a
about a function form, or shape, of f (linear). Then Actual Yes True Positives (TP) False Negatives (FN) better estimate for the test error.
fits the model. This reduces estimating f to just Actual No False Positives (FP) True Negatives (TN)
estimating set of parameters, but if our assumption Key Concepts
Accuracy: ratio of correct predictions over total pre-
was wrong, will lead to bad results. Training Data: data used to fit your models or the set
dictions. Misleading when class sizes are substantially
• Non-Parametric: models that don’t make any as- used for learning
different. accuracy = T P +TT N P +T N
+F N +F P
sumptions about f , which allows them to fit a wider Validation Data: data used to tune the parameters of
Precision: how often the classifier is correct when it
range of shapes; but may lead to overfitting a model
predicts positive: precision = T PT+F P
P
Supervised vs. Unsupervised Test Data: data used to evaluate how good your model
Recall: how often the classifier is correct for all positive
• Supervised: models that fit input variables xi = is. Ideally your model should never touch this data until
instances: recall = T PT+F P
N
(x1 , x2 , ...xn ) to a known output variables yi = final testing/evaluation
F-Score: single measurement to describe performance:
(y1 , y2 , ...yn ) precision·recall
F = 2 · precision + recall
• Unsupervised: models that take in input variables Cross Validation
ROC Curves: plots true positive rates and false pos-
xi = (x1 , x2 , ...xn ), but they do not have an asso- Class of methods that estimate test error by holding out
itive rates for various thresholds, or where the model
ciated output to supervise the training. The goal a subset of training data from the fitting process.
determines if a data point is positive or negative (e.g. if
is understand relationships between the variables or Validation Set: split data into training set and valida-
>0.8, classify as positive). Best possible area under the
observations. tion set. Train model on training and estimate test error
ROC curve (AUC) is 1, while random is 0.5, or the main
Blackbox vs. Descriptive using validation. e.g. 80-20 split
diagonal line.
• Blackbox: models that make decisions, but we do Leave-One-Out CV (LOOCV): split data into
not know what happens ”under the hood” e.g. deep training set and validation set, but the validation set
Regression
learning, neural networks consists of 1 observation. Then repeat n-1 times until all
Errors are defined as the difference between a prediction
• Descriptive: models that provide insight into why observations have been used as validation. Test erro is
y0 and the actual result y.
they make their decisions e.g. linear regression, de- the average of these n test error estimates.
Absolute Error: ∆ = y0 − y
cision trees k-Fold CV: randomly divide data into k groups (folds) of
Squared Error: ∆2 = (y0 − y)2
First-Principle vs. Data-Driven approximately equal size. First fold is used as validation
Mean-Squared Error: M SE = n1 n − yi )2
P
• First-Principle: models based on a prior belief of i=1 (y0
√i and the rest as training. Then repeat k times and find
how the system under investigation works, incorpo- Root Mean-Squared Error: RMSD = M SE average of the k estimates.
rates domain knowledge (ad-hoc) Absolute Error Distribution: Plot absolute error dis-
• Data-Driven: models based on observed correla- tribution: should be symmetric, centered around 0, bell- Bootstrapping
tions between input and output variables shaped, and contain rare extreme outliers. Methods that rely on random sampling with replacement.
Deterministic vs. Stochastic Bootstrapping helps with quantifying uncertainty associ-
• Deterministic: models that produce a single ”pre- ated with a given estimate or model.
diction” e.g. yes or no, true or false
• Stochastic: models that produce probability distri- Amplifying Small Data Sets
butions over possible events What can we do it we don’t have enough data?
Flat vs. Hierarchical • Create Negative Examples- e.g. classifying pres-
• Flat: models that solve problems on a single level, idential candidates, most people would be unquali-
no notion of subproblems fied so label most as unqualified
• Hierarchical: models that solve several different • Synthetic Data- create additional data by adding
nested subproblems noise to the real data
Linear Regression Linear Regression II Logistic Regression
Linear regression is a simple and useful tool for predicting Improving Linear Regression Logistic regression is used for classification, where the
a quantitative response. The relationship between input Subset/Feature Selection: approach involves identify- response variable is categorical rather than numerical.
variables X = (X1 , X2 , ...Xp ) and output variable Y takes ing a subset of the p predictors that we believe to be best
the form: related to the response. Then we fit model using the re- The model works by predicting the probability that Y be-
duced set of variables. longs to a particular category by first fitting the data to a
Y ≈ β0 + β1 X1 + ... + βp Xp +  • Best, Forward, and Backward Subset Selection linear regression model, which is then passed to the logis-
Shrinkage/Regularization: all variables are used, but tic function (below). The logistic function will always pro-
β0 ...βp are the unknown coefficients (parameters) which
estimated coefficients are shrunken towards zero relative duce a S-shaped curve, so regardless of X, we can always
we are trying to determine. The best coefficients
to the least squares estimate. λ represents the tuning obtain a sensible answer (between 0 and 1). If the prob-
will lead us to the best ”fit”, which can be found by
parameter- as λ increases, flexibility decreases → de- ability is above a certain predetermined threshold (e.g.
minimizing the residual sum squares (RSS), or the
creased variance but increased bias. The tuning parameter P(Yes) > 0.5), then the model will predict Yes.
sum of the differences betweenPthe actual ith value and
is key in determining the sweet spot between under and
the predicted ith value. RSS = n i=1 ei , where ei = yi − y
ˆi p(X) = eβ0 +β1 X1 +...+βp Xp
over-fitting. In addition, while Ridge will always produce 1+eβ0 +β1 X1 +...+βp Xp
a model with p variables, Lasso can force coefficients to
How to find best fit? How to find best coefficients?
be equal to zero.
Matrix Form: We can solve the closed-form equation for Maximum Likelihood: The coefficients β0 ...βp are un-
• Lasso (L1): min RSS + λ pj=1 |βj |
P
coefficient vector w: w = (X T X)−1 X T Y . X represents known and must be estimated from the training data. We
• Ridge (L2): min RSS + λ pj=1 βj2
P
the input data and Y represents the output data. This seek estimates for β0 ...βp such that the predicted proba-
method is used for smaller matrices, since inverting a Dimension Reduction: projecting p predictors into a bility p̂(xi ) of each observation is a number close to one if
matrix is computationally expensive. M-dimensional subspace, where M < p. This is achieved its observed in a certain class and close to zero otherwise.
Gradient Descent: First-order optimization algorithm. by computing M different linear combinations of the This is done by maximizing the likelihood function:
We can find the minimum of a convex function by variables. Can use PCA. Y Y
starting at an arbitrary point and repeatedly take steps Miscellaneous: Removing outliers, feature scaling, l(β0 , β1 ) = p(xi ) (1 − p(xi ))
in the downward direction, which can be found by taking removing multicollinearity (correlated variables) i:yi =1 i0 :yi0 =1

the negative direction of the gradient. After several Potential Issues


iterations, we will eventually converge to the minimum. Evaluating Model Accuracy q Imbalanced Classes: imbalance in classes in training
1
In our case, the minimum corresponds to the coefficients Residual Standard Error (RSE): RSE = n−2
RSS. data lead to poor classifiers. It can result in a lot of false
with the minimum error, or the best line of fit. The Generally, the smaller the better. positives and also lead to few training data. Solutions in-
learning rate α determines the size of the steps we take R2 : Measure of fit that represents the proportion of clude forcing balanced data by removing observations from
in the downward direction. variance explained, or the variability in Y that can be the larger class, replicate data from the smaller class, or
explained using X. It takes on a value between 0 and 1. heavily weigh the training examples toward instances of
Gradient descent algorithm in two dimensions. Repeat Generally the higher the better. R2 = 1 − RSS T SS
, where the larger class.
until convergence. Total Sum of Squares (TSS) =
P
(yi − ȳ)2 Multi-Class Classification: the more classes you try to
1. w0t+1 := w0t − α ∂w

0
J(w0 , w1 ) predict, the harder it will be for the the classifier to be ef-
t+1 t ∂
2. w1 := w1 − α ∂w1 J(w0 , w1 ) Evaluating Coefficient Estimates fective. It is possible with logistic regression, but another
Standard Error (SE) of the coefficients can be used to per- approach, such as Linear Discriminant Analysis (LDA),
For non-convex functions, gradient descent no longer guar- form hypothesis tests on the coefficients: may prove better.
antees an optimal solutions since there may be local min- H0 : No relationship between X and Y, Ha : Some rela-
imas. Instead, we should run the algorithm from different tionship exists. A p-value can be obtained and can be
starting points and use the best local minima we find for interpreted as follows: a small p-value indicates that a re-
the solution. lationship between the predictor (X) and the response (Y)
Stochastic Gradient Descent: instead of taking a step exists. Typical p-value cutoffs are around 5 or 1 %.
after sampling the entire training set, we take a small
batch of training data at random to determine our next
step. Computationally more efficient and may lead to
faster convergence.
Distance/Network Methods Nearest Neighbor Classification Clustering
Interpreting examples as points in space provides a way Distance functions allow us to identify the points closest Clustering is the problem of grouping points by sim-
to find natural groupings or clusters among data e.g. to a given target, or the nearest neighbors (NN) to a ilarity using distance metrics, which ideally reflect the
which stars are the closest to our sun? Networks can also given point. The advantages of NN include simplicity, similarities you are looking for. Often items come from
be built from point sets (vertices) by connecting related interpretability and non-linearity. logical ”sources” and clustering is a good way to reveal
points. those origins. Perhaps the first thing to do with any
k-Nearest Neighbors data set. Possible applications include: hypothesis
Measuring Distances/Similarity Measure Given a positive integer k and a point x0 , the KNN development, modeling over smaller subsets of data, data
There are several ways of measuring distances between classifier first identifies k points in the training data reduction, outlier detection.
points a and b in d dimensions- with closer distances most similar to x0 , then estimates the conditional
implying similarity. probability of x0 being in class j as the fraction of K-Means Clustering
qP the k points whose values belong to j. The opti- Simple and elegant algorithm to partition a dataset into
Minkowski Distance Metric: dk (a, b) = k d mal value for k can be found using cross validation. K distinct, non-overlapping clusters.
i=1 |ai − bi |
k

The parameter k provides a way to tradeoff between the 1. Choose a K. Randomly assign a number between 1
largest and the total dimensional difference. In other and K to each observation. These serve as initial
words, larger values of k place more emphasis on large cluster assignments
differences between feature values than smaller values. Se- 2. Iterate until cluster assignments stop changing
lecting the right k can significantly impact the the mean- (a) For each of the K clusters, compute the cluster
ingfulness of your distance function. The most popular centroid. The kth cluster centroid is the vector
values are 1 and 2. of the p feature means for the observations in
• Manhattan (k=1): city block distance, or the sum the kth cluster.
KNN Algorithm (b) Assign each observation to the cluster whose
of the absolute difference between two points
1. Compute distance D(a,b) from point b to all points centroid is closest (where closest is defined us-
• Euclidean (k=2): straight line distance
2. Select k closest points and their labels ing distance metric).
3. Output class with most frequent labels in k points Since the results of the algorithm depends on the initial
Optimizing KNN random assignments, it is a good idea to repeat the
Comparing a query point a in d dimensions against n train- algorithm from different random initializations to obtain
ing examples computes with a runtime of O(nd), which the best overall results. Can use MSE to determine which
can cause lag as points reach millions or billions. Popular cluster assignment is better.
choices to speed up KNN include:
• Vernoi Diagrams: partitioning plane into regions
qP
Weighted Minkowski: dk (a, b) = k d Hierarchical Clustering
i=1 wi |ai − bi | , in
k
based on distance to points in a specific subset of Alternative clustering algorithm that does not require us
some scenarios, not all dimensions are equal. Can convey the plane to commit to a particular K. Another advantage is that it
this idea using wi . Generally not a good idea- should • Grid Indexes: carve up space into d-dimensional results in a nice visualization called a dendrogram. Ob-
normalize data by Z-scores before computing distances. boxes or grids and calculate the NN in the same cell servations that fuse at bottom are similar, where those at
a·b as the point the top are quite different- we draw conclusions based on
Cosine Similarity: cos(a, b) = |a||b| , calculates the • Locality Sensitive Hashing (LSH): abandons the location on the vertical rather than horizontal axis.
similarity between 2 non-zero vectors, where a · b is the the idea of finding the exact nearest neighbors. In- 1. Begin with n observations and a measure of all the
dot product (normalized between 0 and 1), higher values stead, batch up nearby points to quickly find the (n)n−1
2
pairwise dissimilarities. Treat each observa-
imply more similar vectors most appropriate bucket B for our query point. LSH tion as its own cluster.
is defined by a hash function h(p) that takes a 2. For i = n, n-1, ...2
Kullback-Leibler Divergence: KL(A||B) = di=i ai log2 abii
P
point/vector as input and produces a number/ code (a) Examine all pairwise inter-cluster dissimilari-
KL divergence measures the distances between probabil- as output, such that it is likely that h(a) = h(b) if ties among the i clusters and identify the pair
ity distributions by measuring the uncertainty gained or a and b are close to each other, and h(a)!= h(b) if of clusters that are least dissimilar ( most simi-
uncertainty lost when replacing distribution A with dis- they are far apart. lar). Fuse these two clusters. The dissimilarity
tribution B. However, this is not a metric but forms the
between these two clusters indicates height in
basis for the Jensen-Shannon Divergence Metric.
dendrogram where fusion should be placed.
Jensen-Shannon: JS(A, B) = 12 KL(A||M )+ 21 KL(M ||B),
(b) Assign each observation to the cluster whose
where M is the average of A and B. The JS function is the
centroid is closest (where closest is defined us-
right metric for calculating distances between probability
ing distance metric).
distributions
Linkage: Complete (max dissimilarity), Single (min), Av-
erage, Centroid (between centroids of cluster A and B)
Machine Learning Part I Machine Learning Part II Machine Learning Part III
Comparing ML Algorithms Decision Trees Support Vector Machines
Power and Expressibility: ML methods differ in terms Binary branching structure used to classify an arbitrary Work by constructing a hyperplane that separates
of complexity. Linear regression fits linear functions while input vector X. Each node in the tree contains a sim- points between two classes. The hyperplane is de-
NN define piecewise-linear separation boundaries. More ple feature comparison against some field (xi > 42?). termined using the maximal margin hyperplane, which
complex models can provide more accurate models, but Result of each comparison is either true or false, which is the hyperplane that is the maximum distance from
at the risk of overfitting. determines if we should proceed along to the left or the training observations. This distance is called
Interpretability: some models are more transparent right child of the given node. Also known as some- the margin. Points that fall on one side of the
and understandable than others (white box vs. black box times called classification and regression trees (CART). hyperplane are classified as -1 and the other +1.
models)
Ease of Use: some models feature few parame-
ters/decisions (linear regression/NN), while others
require more decision making to optimize (SVMs)
Training Speed: models differ in how fast they fit the
necessary parameters
Prediction Speed: models differ in how fast they make
predictions given a query
Advantages: Non-linearity, support for categorical
variables, easy to interpret, application to regression. Principal Component Analysis (PCA)
Disadvantages: Prone to overfitting, instable (not Principal components allow us to summarize a set of
robust to noise), high variance, low bias correlated variables with a smaller set of variables that
collectively explain most of the variability in the original
Note: rarely do models just use one decision tree. set. Essentially, we are ”dropping” the least important
Instead, we aggregate many decision trees using methods feature variables.
like ensembling, bagging, and boosting.
Naive Bayes Principal Component Analysis is the process by
Naive Bayes methods are a set of supervised learning Ensembles, Bagging, Random Forests, Boosting which principal components are calculated and the use
algorithms based on applying Bayes’ theorem with the Ensemble learning is the strategy of combining many of them to analyzing and understanding the data. PCA
”naive” assumption of independence between every pair different classifiers/models into one predictive model. It is an unsupervised approach and is used for dimensional-
of features. revolves around the idea of voting: a so-called ”wisdom of ity reduction, feature extraction, and data visualization.
crowds” approach. The most predicted class will be the Variables after performing PCA are independent. Scal-
Problem: Suppose we need to classify vector X = x1 ...xn final prediction. ing variables is also important while performing PCA.
into m classes, C1 ...Cm . We need to compute the proba- Bagging: ensemble method that works by taking B boot-
bility of each possible class given X, so we can assign X strapped subsamples of the training data and constructing
the label of the class with highest probability. We can B trees, each tree training on a distinct subsample as
calculate a probability using the Bayes’ Theorem: Random Forests: builds on bagging by decorrelating
P (X|Ci )P (Ci ) the trees. We do everything the same like in bagging, but
P (Ci |X) = when we build the trees, everytime we consider a split, a
P (X)
random sample of the p predictors is chosen as split can-
Where: √
didates, not the full set (typically m ≈ p). When m =
1. P (Ci ): the prior probability of belonging to class i p, then we are just doing bagging.
2. P (X): normalizing constant, or probability of seeing Boosting: the main idea is to improve our model where
the given input vector over all possible input vectors it is not performing well by using information from previ-
3. P (X|Ci ): the conditional probability of seeing ously constructed classifiers. Slow learner. Has 3 tuning
input vector X given we know the class is Ci parameters: number of classifiers B, learning parameter λ,
interaction depth d (controls interaction order of model).
The prediction model will formally look like:
P (X|Ci )P (Ci )
C(X) = argmaxi∈classes(t) P (X)

where C(X) is the prediction returned for input X.


Machine Learning Part IV Deep Learning Part I Deep Learning Part II
ML Terminology and Concepts What is Deep Learning? Tensorflow
Deep learning is a subset of machine learning. One popu- Tensorflow is an open source software library for numeri-
Features: input data/variables used by the ML model lar DL technique is based on Neural Networks (NN), which cal computation using data flow graphs. Everything in
Feature Engineering: transforming input features to loosely mimic the human brain and the code structures TF is a graph, where nodes represent operations on data
be more useful for the models. e.g. mapping categories to are arranged in layers. Each layer’s input is the previous and edges represent the data. Phase 1 of TF is building
buckets, normalizing between -1 and 1, removing null layer’s output, which yields progressively higher-level fea- up a computation graph and phase 2 is executing it. It is
Train/Eval/Test: training is data used to optimize the tures and defines a hierarchy. A Deep Neural Network is also distributed, meaning it can run on either a cluster of
model, evaluation is used to asses the model on new data just a NN that has more than 1 hidden layer. machines or just a single machine.
during training, test is used to provide the final result TF is extremely popular/suitable for working with Neural
Classification/Regression: regression is prediction a Networks, since the way TF sets up the computational
number (e.g. housing price), classification is prediction graph pretty much resembles a NN.
from a set of categories(e.g. predicting red/blue/green)
Linear Regression: predicts an output by multiplying
and summing input features with weights and biases
Logistic Regression: similar to linear regression but
predicts a probability
Overfitting: model performs great on the input data but
poorly on the test data (combat by dropout, early stop-
ping, or reduce # of nodes or layers)
Bias/Variance: how much output is determined by the Recall that statistical learning is all about approximating
features. more variance often can mean overfitting, more f (X). Neural networks are known as universal approx-
bias can mean a bad model imators, meaning no matter how complex a function is,
Regularization: variety of approaches to reduce over- there exists a NN that can (approximately) do the job.
fitting, including adding the weights to the loss function, We can increase the approximation (or complexity) by
randomly dropping layers (dropout) adding more hidden layers and neurons.
Ensemble Learning: training multiple models with dif- Tensors
ferent parameters to solve the same problem Popular Architectures In a graph, tensors are the edges and are multidimensional
A/B testing: statistical way of comparing 2+ techniques There are different kinds of NNs that are suitable for data arrays that flow through the graph. Central unit
to determine which technique performs better and also if certain problems, which depend on the NN’s architecture. of data in TF and consists of a set of primitive values
difference is statistically significant shaped into an array of any number of dimensions.
Baseline Model: simple model/heuristic used as refer- Linear Classifier: takes input features and combines A tensor is characterized by its rank (# dimensions
ence point for comparing how well a model is performing them with weights and biases to predict output value in tensor), shape (# of dimensions and size of each di-
Bias: prejudice or favoritism towards some things, people, DNN: deep neural net, contains intermediate layers of mension), data type (data type of each element in tensor).
or groups over others that can affect collection/sampling nodes that represent “hidden features” and activation
and interpretation of data, the design of a system, and functions to represent non-linearity Placeholders and Variables
how users interact with a system CNN: convolutional NN, has a combination of convolu- Variables: best way to represent shared, persistent state
Dynamic Model: model that is trained online in a con- tional, pooling, dense layers. popular for image classifica- manipulated by your program. These are the parameters
tinuously updating fashion tion. of the ML model are altered/trained during the training
Static Model: model that is trained offline Transfer Learning: use existing trained models as start- process. Training variables.
Normalization: process of converting an actual range of ing points and add additional layers for the specific use Placeholders: way to specify inputs into a graph that
values into a standard range of values, typically -1 to +1 case. idea is that highly trained existing models know hold the place for a Tensor that will be fed at runtime.
Independently and Identically Distributed (i.i.d): general features that serve as a good starting point for They are assigned once, do not change after. Input nodes
data drawn from a distribution that doesn’t change, and training a small network on specific examples
where each value drawn doesn’t depend on previously RNN: recurrent NN, designed for handling a sequence of
drawn values; ideal but rarely found in real life inputs that have ”memory” of the sequence. LSTMs are
Hyperparameters: the ”knobs” that you tweak during a fancy version of RNNs, popular for NLP
successive runs of training a model GAN: general adversarial NN, one model creates fake ex-
Generalization: refers to a model’s ability to make cor- amples, and another model is served both fake example
rect predictions on new, previously unseen data as op- and real examples and is asked to distinguish
posed to the data used to train the model Wide and Deep: combines linear classifiers with deep
Cross-Entropy: quantifies the difference between two neural net classifiers, ”wide” linear parts represent mem-
probability distributions orizing specific examples and “deep” parts represent un-
derstanding high level features
Deep Learning Part III Big Data- Hadoop Overview Big Data- Hadoop Ecosystem
Deep Learning Terminology and Concepts Data can no longer fit in memory on one machine An entire ecosystem of tools have emerged around
(monolithic), so a new way of computing was devised Hadoop, which are based on interacting with HDFS.
Neuron: node in a NN, typically taking in multiple in- using a group of computers to process this ”big data” Below are some popular ones:
put values and generating one output value, calculates the (distributed). Such a group is called a cluster, which
output value by applying an activation function (nonlin- makes up server farms. All of these servers have to be Hive: data warehouse software built o top of Hadoop that
ear transformation) to a weighted sum of input values coordinated in the following ways: partition data, coor- facilitates reading, writing, and managing large datasets
Weights: edges in a NN, the goal of training is to deter- dinate computing tasks, handle fault tolerance/recovery, residing in distributed storage using SQL-like queries
mine the optimal weight for each feature; if weight = 0, and allocate capacity to process. (HiveQL). Hive abstracts away underlying MapReduce
corresponding feature does not contribute jobs and returns HDFS in the form of tables (not HDFS).
Neural Network: composed of neurons (simple building Hadoop Pig: high level scripting language (Pig Latin) that
blocks that actually “learn”), contains activation functions Hadoop is an open source distributed processing frame- enables writing complex data transformations. It pulls
that makes it possible to predict non-linear outputs work that manages data processing and storage for big unstructured/incomplete data from sources, cleans it, and
Activation Functions: mathematical functions that in- data applications running in clustered systems. It is com- places it in a database/data warehouses. Pig performs
troduce non-linearity to a network e.g. RELU, tanh prised of 3 main components: ETL into data warehouse while Hive queries from data
Sigmoid Function: function that maps very negative • Hadoop Distributed File System (HDFS): warehouse to perform analysis (GCP: DataFlow).
numbers to a number very close to 0, huge numbers close a distributed file system that provides high- Spark: framework for writing fast, distributed programs
to 1, and 0 to .5. Useful for predicting probabilities throughput access to application data by partition- for data processing and analysis. Spark solves similar
Gradient Descent/Backpropagation: fundamental ing data across many machines problems as Hadoop MapReduce but with a fast in-
loss optimizer algorithms, of which the other optimizers • YARN: framework for job scheduling and cluster memory approach. It is an unified engine that supports
are usually based. Backpropagation is similar to gradient resource management (task coordination) SQL queries, streaming data, machine learning and
descent but for neural nets • MapReduce: YARN-based system for parallel graph processing. Can operate separately from Hadoop
Optimizer: operation that changes the weights and bi- processing of large data sets on multiple machines but integrates well with Hadoop. Data is processed
ases to reduce loss e.g. Adagrad or Adam using Resilient Distributed Datasets (RDDs), which are
Weights / Biases: weights are values that the input fea- HDFS immutable, lazily evaluated, and tracks lineage. ;
tures are multiplied by to predict an output value. Biases Each disk on a different machine in a cluster is comprised Hbase: non-relational, NoSQL, column-oriented
;
are the value of the output given a weight of 0. of 1 master node and the rest are workers/data nodes. database management system that runs on top of
Converge: algorithm that converges will eventually reach The master node manages the overall file system by HDFS. Well suited for sparse data sets (GCP: BigTable)
an optimal answer, even if very slowly. An algorithm that storing the directory structure and the metadata of the Flink/Kafka: stream processing framework. Batch
doesn’t converge may never reach an optimal answer. files. The data nodes physically store the data. Large streaming is for bounded, finite datasets, with periodic
Learning Rate: rate at which optimizers change weights files are broken up and distributed across multiple ma- updates, and delayed processing. Stream processing
and biases. High learning rate generally trains faster but chines, which are also replicated across multiple machines is for unbounded datasets, with continuous updates,
risks not converging, whereas a lower rate trains slower to provide fault tolerance. and immediate processing. Stream data and stream
Numerical Instability: issues with very large/small val- processing must be decoupled via a message queue.
ues due to limits of floating point numbers in computers MapReduce Can group streaming data (windows) using tumbling
Embeddings: mapping from discrete objects, such as Parallel programming paradigm which allows for process- (non-overlapping time), sliding (overlapping time), or
words, to vectors of real numbers. useful because classi- ing of huge amounts of data by running processes on mul- session (session gap) windows.
fiers/neural networks work well on vectors of real numbers tiple machines. Defining a MapReduce job requires two Beam: programming model to define and execute data
Convolutional Layer: series of convolutional opera- stages: map and reduce. processing pipelines, including ETL, batch and stream
tions, each acting on a different slice of the input matrix • Map: operation to be performed in parallel on small (continuous) processing. After building the pipeline,
Dropout: method for regularization in training NNs, portions of the dataset. the output is a key-value it is executed by one of Beam’s distributed processing
works by removing a random selection of some units in pair < K, V > back-ends (Apache Apex, Apache Flink, Apache Spark,
a network layer for a single gradient step • Reduce: operation to combine the results of Map and Google Cloud Dataflow). Modeled as a Directed
Early Stopping: method for regularization that involves Acyclic Graph (DAG).
ending model training early YARN- Yet Another Resource Negotiator Oozie: workflow scheduler system to manage Hadoop
Gradient Descent: technique to minimize loss by com- Coordinates tasks running on the cluster and assigns new jobs
puting the gradients of loss with respect to the model’s nodes in case of failure. Comprised of 2 subcomponents: Sqoop: transferring framework to transfer large amounts
parameters, conditioned on training data the resource manager and the node manager. The re- of data into HDFS from relational databases (MySQL)
Pooling: Reducing a matrix (or matrices) created by an source manager runs on a single master node and sched-
earlier convolutional layer to a smaller matrix. Pooling ules tasks across nodes. The node manager runs on all
usually involves taking either the maximum or average other nodes and manages tasks on the individual node.
value across the pooled area
SQL Part I Python- Data Structures Recommended Resources
Structured Query Language (SQL) is a declarative Data structures are a way of storing and manipulating • Data Science Design Manual
language used to access & manipulate data in databases. data and each data structure has its own strengths and (www.springer.com/us/book/9783319554433)
Usually the database is a Relational Database Man- weaknesses. Combined with algorithms, data structures • Introduction to Statistical Learning
agement System (RDBMS), which stores data arranged allow us to efficiently solve problems. It is important to (www-bcf.usc.edu/~gareth/ISL/)
in relational database tables. A table is arranged in know the main types of data structures that you will need • Probability Cheatsheet
columns and rows, where columns represent character- to efficiently solve problems. (/www.wzchen.com/probability-cheatsheet/)
istics of stored data and rows represent actual data entries. • Google’s Machine Learning Crash Course
Lists: or arrays, ordered sequences of objects, mutable (developers.google.com/machine-learning/
Basic Queries crash-course/)
>>> l = [42, 3.14, "hello","world"]
- filter columns: SELECT col1, col3... FROM table1
- filter the rows: WHERE col4 = 1 AND col5 = 2 Tuples: like lists, but immutable
- aggregate the data: GROUP BY. . .
- limit aggregated data: HAVING count(*) > 1 >>> t = (42, 3.14, "hello","world")
- order of the results: ORDER BY col2 Dictionaries: hash tables, key-value pairs, unsorted

Useful Keywords for SELECT >>> d = {"life": 42, "pi": 3.14}


DISTINCT- return unique results Sets: mutable, unordered sequence of unique elements.
BETWEEN a AND b- limit the range, the values can frozensets are just immutable sets
be numbers, text, or dates
LIKE- pattern search within the column text >>> s = set([42, 3.14, "hello","world"])
IN (a, b, c) - check if the value is contained among given Collections Module
deque: double-ended queue, generalization of stacks and
Data Modification queues; supports append, appendLeft, pop, rotate, etc
- update specific data with the WHERE clause:
UPDATE table1 SET col1 = 1 WHERE col2 = 2 >>> s = deque([42, 3.14, "hello","world"])
- insert values manually
Counter: dict subclass, unordered collection where ele-
INSERT INTO table1 (col1,col3) VALUES (val1,val3);
ments are stored as keys and counts stored as values
- by using the results of a query
INSERT INTO table1 (col1,col3) SELECT col,col2 >>> c = Counter('apple')
FROM table2; >>> print(c)
Counter({'p': 2, 'a': 1, 'l': 1, 'e': 1})
Joins
heqpq Module
The JOIN clause is used to combine rows from two or more
Heap Queue: priority queue, heaps are binary trees for
tables, based on a related column between them.
which every parent node has a value greater than or equal
to any of its children (max-heap), order is important; sup-
ports push, pop, pushpop, heapify, replace functionality
>>> heap = []
>>> for n in data:
... heappush(heap, n)
>>> heap
[0, 1, 3, 6, 2, 8, 4, 7, 9, 5]
fournova

GIT CHEAT SHEET



presented by TOWER Version control with Git - made easy

CREATE BRANCHES & TAGS MERGE & REBASE


Clone an existing repository List all existing branches Merge <branch> into your current HEAD
$ git clone ssh://user@domain.com/repo.git $ git branch -av $ git merge <branch>

Create a new local repository Switch HEAD branch Rebase your current HEAD onto <branch>
$ git init $ git checkout <branch> Don‘t rebase published commits!
$ git rebase <branch>
Create a new branch based
LOCAL CHANGES on your current HEAD Abort a rebase
$ git branch <new-branch> $ git rebase --abort
Changed files in your working directory
$ git status Create a new tracking branch based on Continue a rebase after resolving conflicts
a remote branch $ git rebase --continue
Changes to tracked files
$ git checkout --track <remote/bran-
$ git diff ch> Use your configured merge tool to
solve conflicts
Add all current changes to the next commit Delete a local branch
$ git mergetool
$ git add . $ git branch -d <branch>
Use your editor to manually solve conflicts
Add some changes in <file> to the next commit Mark the current commit with a tag and ( after resolving) mark file as resolved
$ git add -p <file> $ git tag <tag-name> $ git add <resolved-file>
Commit all local changes in tracked files $ git rm <resolved-file>
$ git commit -a UPDATE & PUBLISH
Commit previously staged changes List all currently configured remotes UNDO
$ git commit $ git remote -v
Discard all local changes in your working
Change the last commit Show information about a remote directory
Don‘t amend published commits! $ git remote show <remote> $ git reset --hard HEAD
$ git commit --amend
Add new remote repository, named <remote>  Discard local changes in a specific file
$ git remote add <shortname> <url> $ git checkout HEAD <file>
COMMIT HISTORY
Download all changes from <remote>, Revert a commit (by producing a new commit
Show all commits, starting with newest but don‘t integrate into HEAD with contrary changes)
$ git log $ git fetch <remote> $ git revert <commit>

Show changes over time for a specific file Download changes and directly Reset your HEAD pointer to a previous commit
$ git log -p <file> merge/integrate into HEAD …and discard all changes since then
$ git pull <remote> <branch> $ git reset --hard <commit>
Who changed what and when in <file>
$ git blame <file> Publish local changes on a remote …and preserve all changes as unstaged
$ git push <remote> <branch> changes
$ git reset <commit>
Delete a branch on the remote
$ git branch -dr <remote/branch> …and preserve uncommitted local changes
$ git reset --keep <commit>
Publish your tags
$ git push --tags

30-day free trial available at


www.git-tower.com Version control with Git - made easy
fournova

VERSION CONTROL
BEST PRACTICES

COMMIT RELATED CHANGES TEST CODE BEFORE YOU COMMIT USE BRANCHES
A commit should be a wrapper for related Resist the temptation to commit some- Branching is one of Git‘s most powerful
changes. For example, fixing two different thing that you «think» is completed. Test it features - and this is not by accident: quick
bugs should produce two separate commits. thoroughly to make sure it really is completed and easy branching was a central requirement
Small commits make it easier for other de- and has no side effects (as far as one can tell). from day one. Branches are the perfect tool
velopers to understand the changes and roll
While committing half-baked things in your to help you avoid mixing up different lines
them back if something went wrong.
local repository only requires you to forgive of development. You should use branches
With tools like the staging area and the abi-
yourself, having your code tested is even more extensively in your development workflows:
lity to stage only parts of a file, Git makes it
easy to create very granular commits. important when it comes to pushing/sharing for new features, bug fixes, ideas…
your code with others.

COMMIT OFTEN WRITE GOOD COMMIT MESSAGES AGREE ON A WORKFLOW


Committing often keeps your commits small Begin your message with a short summary of Git lets you pick from a lot of different work-
and, again, helps you commit only related your changes (up to 50 characters as a gui- flows: long-running branches, topic bran-
changes. Moreover, it allows you to share your deline). Separate it from the following body ches, merge or rebase, git-flow… Which one
code more frequently with others. That way by including a blank line. The body of your you choose depends on a couple of factors:
it‘s easier for everyone to integrate changes
message should provide detailed answers to your project, your overall development and
regularly and avoid having merge conflicts.
the following questions: deployment workflows and (maybe most
Having few large commits and sharing them
›› What was the motivation for the change? importantly) on your and your teammates‘
rarely, in contrast, makes it hard to solve
conflicts. personal preferences. However you choose to
›› How does it differ from the previous work, just make sure to agree on a common
implementation? workflow that everyone follows.
DON‘T COMMIT HALF-DONE WORK Use the imperative, present tense («change»,
You should only commit code when it‘s not «changed» or «changes») to be consistent
HELP & DOCUMENTATION
completed. This doesn‘t mean you have with generated messages from commands
to complete a whole, large feature before like git merge. Get help on the command line
committing. Quite the contrary: split the $ git help <command>

feature‘s implementation into logical chunks


VERSION CONTROL IS NOT
and remember to commit early and often.
A BACKUP SYSTEM FREE ONLINE RESOURCES
But don‘t commit just to have something in
http://www.git-tower.com/learn
the repository before leaving the office at the Having your files backed up on a remote
end of the day. If you‘re tempted to commit http://rogerdudler.github.io/git-guide/
server is a nice side effect of having a version
just because you need a clean working copy control system. But you should not use your http://www.git-scm.org/
(to check out a branch, pull in changes, etc.) VCS like it was a backup system. When doing
consider using Git‘s «Stash» feature instead. version control, you should pay attention to
committing semantically (see «related chan-
ges») - you shouldn‘t just cram in files.

30-day free trial available at


www.git-tower.com Version control with Git - made easy
The goal of parameterization is to achieve a low bias Tips:
MACHINE LEARNING CHEATSHEET (underlying pattern not too simplified) and low variance
- Change learning rate 𝛼 (“size of jump” at each iteration)
Summary of Machine Learning Algorithms descriptions, (not sensitive to specificities of the training data) tradeoff.
advantages and use cases. Inspired by the very good book - Plot Cost vs Time to assess learning rate performance
Underfitting, Overfitting
and articles of MachineLearningMastery, with added math, - Rescaling the input variables
and ML Pros & Cons of HackingNote. Design inspired by The In statistics, fit refers to how well the target function is
Probability Cheatsheet of W. Chen. Written by Rémi Canard. approximated. - Reduce passes through training set with SGD

Underfitting refers to poor inductive learning from training - Average over 10 or more updated to observe the learning
data and poor generalization. trend while using SGD
General
Overfitting refers to learning the training data detail and Ordinary Least Squares
Definition noise which leads to poor generalization. It can be limited
by using resampling and defining a validation dataset. OLS is used to find the estimator 𝛽 that minimizes the sum
We want to learn a target function f that maps input of squared residuals: E?CD(𝑦? − 𝛽@ − B9CD 𝛽9 𝑥?9 )F = 𝑦 − 𝑋 𝛽
variables X to output variable Y, with an error e:

𝑌 = 𝑓 𝑋 + 𝑒 Optimization
Linear, Nonlinear Almost every machine learning method has an optimization
Different algorithms make different assumptions about the algorithm at its core.
shape and structure of f, thus the need of testing several Gradient Descent
methods. Any algorithm can be either:
Gradient Descent is used to find the coefficients of f that
- Parametric (or Linear): simplify the mapping to a known minimizes a cost function (for example MSE, SSR). Using linear algebra such that we have 𝛽 = (𝑋 G 𝑋)HD 𝑋 G 𝑦
linear combination form and learning its coefficients.
Procedure: Maximum Likelihood Estimation
- Non parametric (or Nonlinear): free to learn any
functional form from the training data, while maintaining à Initialization 𝜃 = 0 (coefficients to 0 or random) MLE is used to find the estimators that minimizes the
some ability to generalize. likelihood function:
à Calculate cost 𝐽(𝜃) = 𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒(𝑓 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 )
Linear algorithms are usually simpler, faster and requires 7 ℒ 𝜃 𝑥 = 𝑓8 (𝑥) density function of the data distribution
less data, while Nonlinear can be are more flexible, more à Gradient of cost 𝐽(𝜃) we know the uphill direction
789

powerful and more performant. 7
à Update coeff 𝜃𝑗 = 𝜃𝑗 − 𝛼 𝐽(𝜃) we go downhill
Supervised, Unsupervised
789 Linear Algorithms
The cost updating process is repeated until convergence
Supervised learning methods learn to predict Y from X All linear Algorithms assume a linear relationship between
(minimum found).
given that the data is labeled. the input variables X and the output variable Y.

Unsupervised learning methods learn to find the inherent Linear Regression


structure of the unlabeled data.
Representation:
Bias-Variance trade-off
A LR model representation is a linear equation:
In supervised learning, the prediction error e is composed
𝑦 = 𝛽@ + 𝛽D 𝑥D + ⋯ + 𝛽? 𝑥?
of the bias, the variance and the irreducible part.
𝛽@ is usually called intercept or bias coefficient. The
Bias refers to simplifying assumptions made to learn the
dimension of the hyperplane of the regression is its
target function easily. Batch Gradient Descend does summing/averaging of the complexity.
Variance refers to sensitivity of the model to changes in the cost over all the observations.
training data. Stochastic Gradient Descent apply the procedure of
parameter updating for each observation.
Logistic Regression Linear Discriminant Analysis

It is the go-to for binary classification. For multiclass classification, LDA is the preferred linear
technique.
Representation:
Representation:
Logistic regression a linear method but predictions are
transformed using the logistic function (or sigmoid): LDA representation consists of statistical properties
calculated for each class: means and the covariance matrix:

D E F D E F
Learning: 𝜇Z =
E[
?CD 𝑥? and 𝜎 =
EH] ?CD(𝑥? − 𝜇Z )
Learning a LR means estimating the coefficients from the
training data. Common methods include Gradient Descent
or Ordinary Least Squares.
Variations:
There are extensions of LR training called regularization
methods, that aim to reduce the complexity of the model:
𝜙 is S-shaped and map real-valued number in (0,1).
- Lasso Regression: where OLS is modified to minimize the
sum of the coefficients (L1 regularization) The representation is an equation with binary output:
E B B B
𝑒 QR SQT UT S⋯SQV UV LDA assumes Gaussian data and attributes of same 𝝈𝟐 .
(𝑦? − 𝛽@ − 𝛽9 𝑥?9 )F + 𝜆 |𝛽9 | = 𝑅𝑆𝑆 + 𝜆 |𝛽9 | 𝑦 =
1 + 𝑒 QR SQT UT S⋯SQV UV Predictions are made using Bayes Theorem:
?CD 9CD 9CD 9CD

Which actually models the probability of default class:


- Ridge Regression: where OLS is modified to minimize the 𝑃(𝑘)×𝑃(𝑥|𝑘)
𝑃 𝑌=𝑘𝑋=𝑥 =
squared sum of the coefficients (L2 regularization) 𝑒 QR SQT UT S⋯SQV UV ]
cCD 𝑃(𝑙)×𝑃(𝑥|𝑙)
𝑝 𝑋 = = 𝑝 𝑌 = 1 𝑋
E B B B 1 + 𝑒 QR SQT UT S⋯SQV UV
to obtain a discriminate function (latent variable) for each
(𝑦? − 𝛽@ − 𝛽9 𝑥?9 )F + 𝜆 𝛽9 F = 𝑅𝑆𝑆 + 𝜆 𝛽9 F
Learning: class k, estimating 𝑃(𝑥|𝑘) with a Gaussian distribution:
?CD 9CD 9CD 9CD

where 𝜆 ≥ 0 is a tuning parameter to be determined. Learning the Logistic regression coefficients is done using 𝜇Z 𝜇Z F
maximum-likelihood estimation, to predict values close to 𝐷Z 𝑥 = 𝑥 × − + ln (𝑃 𝑘 )
Data preparation: 𝜎 F 2𝜎 F
1 for default class and close to 0 for the other class.
The class with largest discriminant value is the output
- Transform data for linear relationship (ex: log transform Data preparation: class.
…for exponential relationship)
- Probability transformation to binary for classification Variations:
- Remove noise such as outliers
- Remove noise such as outliers - Quadratic DA: Each class uses its own variance estimate
- Rescale inputs using standardization or normalization
Advantages: - Regularized DA: Regularization into the variance estimate
Advantages:
+ Good classification baseline considering simplicity Data preparation:
+ Good regression baseline considering simplicity
+ Possibility to change cutoff for precision/recall tradeoff - Review and modify univariate distributions to be Gaussian
+ Lasso/Ridge can be used to avoid overfitting
+ Robust to noise/overfitting with L1/L2 regularization - Standardize data to 𝜇 = 0, 𝜎 = 1 to have same variance
+ Lasso/Ridge permit feature selection in case of collinearity
+ Probability output can be used for ranking - Remove noise such as outliers
Usecase examples:
Usecase examples: Advantages:
- Product sales prediction according to prices or promotions
- Customer scoring with probability of purchase + Can be used for dimensionality reduction by keeping the
-.Call-center waiting-time prediction according to the …latent variables as new variables
…number of complaints and the number of working agents - Classification of loan defaults according to profile
E
Usecase example: Variations:
𝐺= 𝑝Z (1 − 𝑝Z )
- Prediction of customer churn ?CD
Gaussian Naive Bayes can extend to numerical attributes
by assuming a Gaussian distribution.
The Gini index is an indication of how pure are the leaves,
if all observations are the same type G=0 (perfect purity), Instead of 𝑃(𝑥|ℎ) are calculated with 𝑃(ℎ) during learning:
Nonlinear Algorithms while a 50-50 split for binary would be G=0.5 (worst purity).
D E D E
𝜇 𝑥 = ?CD 𝑥? and 𝜎 = ?CD(𝑥? − 𝜇 𝑥 )F
All Nonlinear Algorithms are non-parametric and more The most common Stopping Criterion for splitting is a E E
flexible. They are not sensible to outliers and do not require minimum of training observations per node.
and MAP for prediction is calculated using Gaussian PDF
any shape of distribution.
The simplest form of pruning is Reduced Error Pruning: (UH~)•
1 H
Classification and Regression Trees Starting at the leaves, each node is replaced with its most 𝑓 𝑥 𝜇 𝑥 ,𝜎 = 𝑒 F€ •
popular class. If the prediction accuracy is not affected, then 2𝜋𝜎
Also referred as CART or Decision Trees, this algorithm is the
the change is kept Data preparation:
foundation of Random Forest and Boosted Trees.
Advantages: - Change numerical inputs to categorical (binning) or near-
Representation:
…Gaussian inputs (remove outliers, log & boxcox transform)
+ Easy to interpret and no overfitting with pruning
The model representation is a binary tree, where each
- Other distributions can be used instead of Gaussian
node is an input variable x with a split point and each leaf + Works for both regression and classification problems
contain an output variable y for prediction. - Log-transform of the probabilities can avoid overflow
+ Can take any type of variables without modifications, and
…do not require any data preparation - Probabilities can be updated as data becomes available
Usecase examples: Advantages:
- Fraudulent transaction classification + Fast because of the calculations
- Predict human resource allocation in companies + If the naive assumptions works can converge quicker than
…other models. Can be used on smaller training data.
Naive Bayes Classifier
+ Good for few categories variables
Naive Bayes is a classification algorithm interested in
selecting the best hypothesis h given data d assuming there Usecase examples:
is no interaction between features.
The model actually split the input space into (hyper) - Article classification using binary word presence
rectangles, and predictions are made according to the area Representation:
- Email spam detection using a similar technique
observations fall into. The representation is the based on Bayes Theorem:
K-Nearest Neighbors
Learning: 𝑃(𝑑|ℎ)×𝑃(ℎ)
𝑃 ℎ𝑑 = If you are similar to your neighbors, you are one of them.
Learning of a CART is done by a greedy approach called 𝑃(𝑑)
recursive binary splitting of the input space: with naïve hypothesis 𝑃 ℎ 𝑑 = 𝑃 𝑥D ℎ × …×𝑃(𝑥? |ℎ) Representation:

At each step, the best predictor 𝑋9 and the best cutpoint s KNN uses the entire training set, no training is required.
The prediction is the Maximum A posteriori Hypothesis:
are selected such that 𝑋 𝑋9 < 𝑠 and 𝑋 𝑋9 ≥ 𝑠 Predictions are made by searching the k similar instances,
minimizes the cost. 𝑀𝐴𝑃 ℎ = max 𝑃 ℎ 𝑑 = max (𝑃 𝑑 ℎ ×𝑃 ℎ )
according to a distance, and summarizing the output.
- For regression the cost is the Sum of Squared Error: The denominator is not kept as it is only for normalization.

E F Learning:
(𝑦? − 𝑦) Training is fast because only probabilities need to be
?CD calculated:
- For classification the cost function is the Gini index: ?ErstEuvrw uxyEs(U ∧ {)
𝑃 ℎ = and 𝑃 𝑥 ℎ =
tcc ?ErstEuvr ?ErstEuvrw

For regression the output can be the mean, while for Ensemble Algorithms
classification the output can be the most common class.
Ensemble methods use multiple, simpler algorithms
Various distances can be used, for example:
combined to obtain better performance.
- Euclidean Distance, good for similar type of variables
Bagging and Random Forest
E
Random Forest is part of a bigger type of ensemble
𝑑 𝑎, 𝑏 = (𝑎? − 𝑏? )F
?CD methods called Bootstrap Aggregation or Bagging. Bagging
can reduce the variance of high-variance models.
- Manhattan Distance, good for different type of variables
It uses the Bootstrap statistical procedure: estimate a
E The prediction function is the signed distance of the new
quantity from a sample by creating many random
𝑑 𝑎, 𝑏 = |𝑎? − 𝑏? | input x to the separating hyperplane w:
subsamples with replacement, and computing the mean of
?CD each subsample.
𝑓 𝑥 =< 𝑤, 𝑥 > + 𝜌 = 𝑤 G 𝑥 + 𝜌 with 𝜌 the bias
The best value of k must be found by testing, and the
algorithm is sensible to the Curse of dimensionality. Which gives for linear kernel, with 𝑥? the support vectors:
E
Data preparation: 𝑓 𝑥 = 𝑎? ×(𝑥×𝑥? )) + 𝜌
?CD
- Rescale inputs using standardization or normalization
Learning:
- Address missing data for distance calculations
The hyperplane learning is done by transforming the
- Dimensionality reduction or feature selection for COD problem using linear algebra, and minimizing: Representation:
E
Advantages: 1 For bagged decision trees, the steps would be:
max 0,1 − 𝑦? 𝑤. 𝑥† − 𝑏 + 𝜆||𝑤||F
+ Effective if the training data is large 𝑛 - Create many subsamples of the training dataset
?CD

+ No learning phase Variations: - Train a CART model on each sample


+ Robust to noisy data, no need to filter outliers SVM is implemented using various kernels, which define the - Given a new dataset, calculate the average prediction
measure between new data and support vectors:
Usecase examples: However, combining models works best if submodels are
- Linear (dot-product): 𝐾 𝑥, 𝑥? = (𝑥×𝑥? ) weakly correlated at best.
- Recommending products based on similar customers
- Polynomial: 𝐾 𝑥, 𝑥? = 1 + (𝑥×𝑥? )ˆ Random Forest is a tweaked version of bagged decision
- Anomaly detection in customer behavior
trees to reduce tree correlation.
)• )
Support Vector Machines - Radial: 𝐾 𝑥, 𝑥? = 𝑒 H‰ ((UHUV

Learning:
SVM is a go-to for high performance with little tuning Data preparation:
During learning, each sub-tree can only access a random
Representation: - SVM assumes numeric inputs, may require dummy sample of features when selecting the split points. The size
….transformation of categorical features of the feature sample at each split is a parameter m.
In SVM, a hyperplane is selected to separate the points in
the input variable space by their class, with the largest Advantages: B
A good default is 𝑝 for classification and for regression.
Š
margin. + Allow nonlinear separation with nonlinear Kernels
The OOB estimate is the performance of each model on its
The closest datapoints (defining the margin) are called the + Works good in high dimensional space Out-Of-Bag (not selected) samples. It is a reliable estimate
support vectors. of test error.
+ Robust to multicollinearity and overfitting
But real data cannot be perfectly separated, that is why a Bagged method can provide feature importance, by
C defines the amount of violation of the margin allowed. Usecase examples: calculating and averaging the error function drop for
- Face detection from images individual variables (depending of samples where a
The lower C, the more sensitive SVM is to training data.
variable is selected or not).
- Target Audience Classification from tweets
Advantages: The incorrectly predicted instance are given more weight. Interesting Resources
In addition to the advantages of the CART algorithm Weak models are added sequentially using the training

weights, until no improvement can be made or the number
+ Robust to overfitting and missing variables
of rounds has been attained. Machine Learning Mastery website
+ Can be parallelized for distributed computing
> https://machinelearningmastery.com/
+ Performance as good as SVM but easier to interpret
Scikit-learn website, for python implementation
Usecase examples:
> http://scikit-learn.org/
- Predictive machine maintenance
W.Chen probability cheatsheet
- Optimizing line decision for credit cards
> https://github.com/wzchen/probability_cheatsheet
Boosting and AdaBoost
HackingNote, for interesting, condensed insights
AdaBoost was the first successful boosting algorithm Data preparation:
> https://www.hackingnote.com/
developed for binary classification. - Outliers should be removed for AdaBoost
Seattle Data Guy blog, for business oriented articles
Representation: Advantages:
> https://www.theseattledataguy.com/
A boost classifier is of the form + High performance with no tuning (only number of rounds)
G
Explained visually, making hard ideas intuitive
𝐹G 𝑥 = 𝑓s (𝑥) > http://setosa.io/ev/
sCD
This Machine Learning Cheatsheet
where each 𝑓s is a week learner correcting the errors of the
previous one. > https://github.com/remicnrd/ml_cheatsheet

Adaboost is commonly used with decision trees with one


level (decision stumps).
Predictions are made using the weighted average of the
weak classifiers.
Learning:
D
Each training set instance is initially weighted 𝑤 𝑥? =
E

One decision stump is prepared using the weighted


samples, and a misclassification rate is calculated:
E
?CD(𝑤? ×𝑝v••x• ? )
𝜖= E
?CD 𝑤

Which is the weighted sum of the misclassification rates,


where w is the training instance i weight and 𝑝v••x• ? its
prediction error (1 or 0).
A stage value is computed from the misclassification rate:
1 − 𝜖
𝑠𝑡𝑎𝑔𝑒 = ln ( )
𝜖
This stage value is used to update the instances weights:
𝑤 = 𝑤×𝑒 rst•vו
Dictionaries store connections between pieces of
List comprehensions information. Each item in a dictionary is a key-value pair.
squares = [x**2 for x in range(1, 11)] A simple dictionary
Slicing a list alien = {'color': 'green', 'points': 5}
finishers = ['sam', 'bob', 'ada', 'bea']
Accessing a value
first_two = finishers[:2]
print("The alien's color is " + alien['color'])
Variables are used to store values. A string is a series of Copying a list
characters, surrounded by single or double quotes. Adding a new key-value pair
copy_of_bikes = bikes[:]
Hello world alien['x_position'] = 0
print("Hello world!") Looping through all key-value pairs
Hello world with a variable Tuples are similar to lists, but the items in a tuple can't be fav_numbers = {'eric': 17, 'ever': 4}
modified. for name, number in fav_numbers.items():
msg = "Hello world!"
print(name + ' loves ' + str(number))
print(msg) Making a tuple
Concatenation (combining strings) dimensions = (1920, 1080)
Looping through all keys
fav_numbers = {'eric': 17, 'ever': 4}
first_name = 'albert'
for name in fav_numbers.keys():
last_name = 'einstein'
print(name + ' loves a number')
full_name = first_name + ' ' + last_name If statements are used to test for particular conditions and
print(full_name) respond appropriately. Looping through all the values
Conditional tests fav_numbers = {'eric': 17, 'ever': 4}
for number in fav_numbers.values():
equals x == 42 print(str(number) + ' is a favorite')
A list stores a series of items in a particular order. You
not equal x != 42
access items using an index, or within a loop.
greater than x > 42
Make a list or equal to x >= 42
less than x < 42 Your programs can prompt the user for input. All input is
bikes = ['trek', 'redline', 'giant'] or equal to x <= 42 stored as a string.
Get the first item in a list Conditional test with lists Prompting for a value
first_bike = bikes[0] 'trek' in bikes name = input("What's your name? ")
Get the last item in a list 'surly' not in bikes print("Hello, " + name + "!")
last_bike = bikes[-1] Assigning boolean values Prompting for numerical input
Looping through a list game_active = True age = input("How old are you? ")
can_edit = False age = int(age)
for bike in bikes:
print(bike) A simple if test
pi = input("What's the value of pi? ")
Adding items to a list if age >= 18: pi = float(pi)
print("You can vote!")
bikes = []
bikes.append('trek') If-elif-else statements
bikes.append('redline') if age < 4:
bikes.append('giant') ticket_price = 0
Making numerical lists elif age < 18: Covers Python 3 and Python 2
ticket_price = 10
squares = [] else:
for x in range(1, 11): ticket_price = 15
squares.append(x**2)
A while loop repeats a block of code as long as a certain A class defines the behavior of an object and the kind of Your programs can read from files and write to files. Files
condition is true. information an object can store. The information in a class are opened in read mode ('r') by default, but can also be
is stored in attributes, and functions that belong to a class opened in write mode ('w') and append mode ('a').
A simple while loop are called methods. A child class inherits the attributes and
methods from its parent class. Reading a file and storing its lines
current_value = 1
while current_value <= 5: filename = 'siddhartha.txt'
Creating a dog class
print(current_value) with open(filename) as file_object:
current_value += 1 class Dog(): lines = file_object.readlines()
"""Represent a dog."""
Letting the user choose when to quit for line in lines:
msg = '' def __init__(self, name): print(line)
while msg != 'quit': """Initialize dog object."""
self.name = name Writing to a file
msg = input("What's your message? ")
print(msg) filename = 'journal.txt'
def sit(self): with open(filename, 'w') as file_object:
"""Simulate sitting.""" file_object.write("I love programming.")
print(self.name + " is sitting.")
Functions are named blocks of code, designed to do one Appending to a file
specific job. Information passed to a function is called an my_dog = Dog('Peso')
argument, and information received by a function is called a filename = 'journal.txt'
parameter. with open(filename, 'a') as file_object:
print(my_dog.name + " is a great dog!") file_object.write("\nI love making games.")
A simple function my_dog.sit()

def greet_user(): Inheritance


"""Display a simple greeting.""" class SARDog(Dog): Exceptions help you respond appropriately to errors that
print("Hello!") """Represent a search dog.""" are likely to occur. You place code that might cause an
error in the try block. Code that should run in response to
greet_user() def __init__(self, name): an error goes in the except block. Code that should run only
Passing an argument """Initialize the sardog.""" if the try block was successful goes in the else block.
super().__init__(name)
def greet_user(username): Catching an exception
"""Display a personalized greeting.""" def search(self): prompt = "How many tickets do you need? "
print("Hello, " + username + "!") """Simulate searching.""" num_tickets = input(prompt)
print(self.name + " is searching.")
greet_user('jesse') try:
my_dog = SARDog('Willie') num_tickets = int(num_tickets)
Default values for parameters
except ValueError:
def make_pizza(topping='bacon'): print(my_dog.name + " is a search dog.") print("Please try again.")
"""Make a single-topping pizza.""" my_dog.sit() else:
print("Have a " + topping + " pizza!") my_dog.search() print("Your tickets are printing.")

make_pizza()
make_pizza('pepperoni')
If you had infinite programming skills, what would you Simple is better than complex
Returning a value build?
If you have a choice between a simple and a complex
def add_numbers(x, y): As you're learning to program, it's helpful to think solution, and both work, use the simple solution. Your
"""Add two numbers and return the sum.""" about the real-world projects you'd like to create. It's
return x + y code will be easier to maintain, and it will be easier
a good habit to keep an "ideas" notebook that you for you and others to build on that code later on.
can refer to whenever you want to start a new project.
sum = add_numbers(3, 5)
print(sum)
If you haven't done so already, take a few minutes
and describe three projects you'd like to create. More cheat sheets available at
You can add elements to the end of a list, or you can insert The sort() method changes the order of a list permanently.
them wherever you like in a list. The sorted() function returns a copy of the list, leaving the
original list unchanged. You can sort the items in a list in
Adding an element to the end of the list alphabetical order, or reverse alphabetical order. You can
users.append('amy') also reverse the original order of the list. Keep in mind that
lowercase and uppercase letters may affect the sort order.
Starting with an empty list
Sorting a list permanently
users = []
A list stores a series of items in a particular order. users.append('val') users.sort()
Lists allow you to store sets of information in one users.append('bob') Sorting a list permanently in reverse alphabetical
place, whether you have just a few items or millions users.append('mia')
order
of items. Lists are one of Python's most powerful Inserting elements at a particular position
features readily accessible to new programmers, and users.sort(reverse=True)
they tie together many important concepts in users.insert(0, 'joe')
Sorting a list temporarily
programming. users.insert(3, 'bea')
print(sorted(users))
print(sorted(users, reverse=True))

You can remove elements by their position in a list, or by Reversing the order of a list
Use square brackets to define a list, and use commas to
separate individual items in the list. Use plural names for the value of the item. If you remove an item by its value, users.reverse()
lists, to make your code easier to read. Python removes only the first item that has that value.

Making a list Deleting an element by its position

users = ['val', 'bob', 'mia', 'ron', 'ned'] del users[-1] Lists can contain millions of items, so Python provides an
efficient way to loop through all the items in a list. When
Removing an item by its value you set up a loop, Python pulls each item from the list one
users.remove('mia') at a time and stores it in a temporary variable, which you
Individual elements in a list are accessed according to their provide a name for. This name should be the singular
position, called the index. The index of the first element is version of the list name.
0, the index of the second element is 1, and so forth. The indented block of code makes up the body of the
Negative indices refer to items at the end of the list. To get If you want to work with an element that you're removing loop, where you can work with each individual item. Any
a particular element, write the name of the list and then the from the list, you can "pop" the element. If you think of the lines that are not indented run after the loop is completed.
index of the element in square brackets. list as a stack of items, pop() takes an item off the top of the
stack. By default pop() returns the last element in the list,
Printing all items in a list
Getting the first element but you can also pop elements from any position in the list. for user in users:
first_user = users[0] print(user)
Pop the last item from a list
Getting the second element most_recent_user = users.pop() Printing a message for each item, and a separate
print(most_recent_user) message afterwards
second_user = users[1]
for user in users:
Getting the last element Pop the first item in a list
print("Welcome, " + user + "!")
newest_user = users[-1] first_user = users.pop(0)
print(first_user) print("Welcome, we're glad to see you all!")

Once you've defined a list, you can change individual


elements in the list. You do this by referring to the index of The len() function returns the number of items in a list.
the item you want to modify.
Find the length of a list
Covers Python 3 and Python 2
Changing an element num_users = len(users)
users[0] = 'valerie' print("We have " + str(num_users) + " users.")
users[-2] = 'ronald'
You can use the range() function to work with a set of To copy a list make a slice that starts at the first item and A tuple is like a list, except you can't change the values in a
numbers efficiently. The range() function starts at 0 by ends at the last item. If you try to copy a list without using tuple once it's defined. Tuples are good for storing
default, and stops one number below the number passed to this approach, whatever you do to the copied list will affect information that shouldn't be changed throughout the life of
it. You can use the list() function to efficiently generate a the original list as well. a program. Tuples are designated by parentheses instead
large list of numbers. of square brackets. (You can overwrite an entire tuple, but
Making a copy of a list you can't change the individual elements in a tuple.)
Printing the numbers 0 to 1000
finishers = ['kai', 'abe', 'ada', 'gus', 'zoe'] Defining a tuple
for number in range(1001): copy_of_finishers = finishers[:]
print(number) dimensions = (800, 600)

Printing the numbers 1 to 1000 Looping through a tuple


for number in range(1, 1001): You can use a loop to generate a list based on a range of for dimension in dimensions:
print(number) numbers or on another list. This is a common operation, so print(dimension)
Python offers a more efficient way to do it. List
Making a list of numbers from 1 to a million comprehensions may look complicated at first; if so, use the Overwriting a tuple
for loop approach until you're ready to start using dimensions = (800, 600)
numbers = list(range(1, 1000001)) comprehensions. print(dimensions)
To write a comprehension, define an expression for the
values you want to store in the list. Then write a for loop to
dimensions = (1200, 900)
generate input values needed to make the list.
There are a number of simple statistics you can run on a list
containing numerical data. Using a loop to generate a list of square numbers
Finding the minimum value in a list squares = [] When you're first learning about data structures such as
ages = [93, 99, 66, 17, 85, 1, 35, 82, 2, 77] for x in range(1, 11): lists, it helps to visualize how Python is working with the
youngest = min(ages) square = x**2 information in your program. pythontutor.com is a great tool
squares.append(square) for seeing how Python keeps track of the information in a
Finding the maximum value list. Try running the following code on pythontutor.com, and
Using a comprehension to generate a list of square
then run your own code.
ages = [93, 99, 66, 17, 85, 1, 35, 82, 2, 77] numbers
oldest = max(ages) Build a list and print the items in the list
squares = [x**2 for x in range(1, 11)]
Finding the sum of all values dogs = []
Using a loop to convert a list of names to upper case dogs.append('willie')
ages = [93, 99, 66, 17, 85, 1, 35, 82, 2, 77]
names = ['kai', 'abe', 'ada', 'gus', 'zoe'] dogs.append('hootz')
total_years = sum(ages)
dogs.append('peso')
upper_names = [] dogs.append('goblin')
for name in names:
You can work with any set of elements from a list. A portion upper_names.append(name.upper()) for dog in dogs:
of a list is called a slice. To slice a list start with the index of print("Hello " + dog + "!")
Using a comprehension to convert a list of names to print("I love these dogs!")
the first item you want, then add a colon and the index after
the last item you want. Leave off the first index to start at upper case
the beginning of the list, and leave off the last index to slice print("\nThese were my first two dogs:")
names = ['kai', 'abe', 'ada', 'gus', 'zoe']
through the end of the list. old_dogs = dogs[:2]
for old_dog in old_dogs:
Getting the first three items upper_names = [name.upper() for name in names]
print(old_dog)
finishers = ['kai', 'abe', 'ada', 'gus', 'zoe']
first_three = finishers[:3] del dogs[0]
Readability counts dogs.remove('peso')
Getting the middle three items print(dogs)
 Use four spaces per indentation level.
middle_three = finishers[1:4]
 Keep your lines to 79 characters or fewer.
Getting the last three items  Use single blank lines to group parts of your More cheat sheets available at
last_three = finishers[-3:] program visually.
You can store as many key-value pairs as you want in a You can loop through a dictionary in three ways: you can
dictionary, until your computer runs out of memory. To add loop through all the key-value pairs, all the keys, or all the
a new key-value pair to an existing dictionary give the name values.
of the dictionary and the new key in square brackets, and A dictionary only tracks the connections between keys
set it equal to the new value. and values; it doesn't track the order of items in the
This also allows you to start with an empty dictionary and dictionary. If you want to process the information in order,
add key-value pairs as they become relevant. you can sort the keys in your loop.

Adding a key-value pair Looping through all key-value pairs


alien_0 = {'color': 'green', 'points': 5} # Store people's favorite languages.
fav_languages = {
alien_0['x'] = 0 'jen': 'python',
Python's dictionaries allow you to connect pieces of
alien_0['y'] = 25 'sarah': 'c',
related information. Each piece of information in a 'edward': 'ruby',
alien_0['speed'] = 1.5
dictionary is stored as a key-value pair. When you 'phil': 'python',
provide a key, Python returns the value associated Adding to an empty dictionary }
with that key. You can loop through all the key-value alien_0 = {}
pairs, all the keys, or all the values. alien_0['color'] = 'green' # Show each person's favorite language.
alien_0['points'] = 5 for name, language in fav_languages.items():
print(name + ": " + language)

Use curly braces to define a dictionary. Use colons to Looping through all the keys
connect keys and values, and use commas to separate You can modify the value associated with any key in a # Show everyone who's taken the survey.
individual key-value pairs. dictionary. To do so give the name of the dictionary and for name in fav_languages.keys():
enclose the key in square brackets, then provide the new print(name)
Making a dictionary
value for that key.
alien_0 = {'color': 'green', 'points': 5} Looping through all the values
Modifying values in a dictionary
# Show all the languages that have been chosen.
alien_0 = {'color': 'green', 'points': 5} for language in fav_languages.values():
print(alien_0) print(language)
To access the value associated with an individual key give
the name of the dictionary and then place the key in a set of # Change the alien's color and point value. Looping through all the keys in order
square brackets. If the key you're asking for is not in the alien_0['color'] = 'yellow'
dictionary, an error will occur. # Show each person's favorite language,
alien_0['points'] = 10 # in order by the person's name.
You can also use the get() method, which returns None print(alien_0)
instead of an error if the key doesn't exist. You can also for name in sorted(fav_languages.keys()):
specify a default value to use if the key is not in the print(name + ": " + language)
dictionary.
Getting the value associated with a key You can remove any key-value pair you want from a
dictionary. To do so use the del keyword and the dictionary
You can find the number of key-value pairs in a dictionary.
alien_0 = {'color': 'green', 'points': 5} name, followed by the key in square brackets. This will
delete the key and its associated value. Finding a dictionary's length
print(alien_0['color'])
print(alien_0['points']) Deleting a key-value pair num_responses = len(fav_languages)
alien_0 = {'color': 'green', 'points': 5}
Getting the value with get()
print(alien_0)
alien_0 = {'color': 'green'}
del alien_0['points']
alien_color = alien_0.get('color') print(alien_0)
alien_points = alien_0.get('points', 0) Covers Python 3 and Python 2

print(alien_color)
print(alien_points) Try running some of these examples on pythontutor.com.
It's sometimes useful to store a set of dictionaries in a list; Storing a list inside a dictionary alows you to associate Standard Python dictionaries don't keep track of the order
this is called nesting. more than one value with each key. in which keys and values are added; they only preserve the
association between each key and its value. If you want to
Storing dictionaries in a list Storing lists in a dictionary preserve the order in which keys and values are added, use
# Start with an empty list. # Store multiple languages for each person. an OrderedDict.
users = [] fav_languages = { Preserving the order of keys and values
'jen': ['python', 'ruby'],
# Make a new user, and add them to the list. 'sarah': ['c'], from collections import OrderedDict
new_user = { 'edward': ['ruby', 'go'],
'last': 'fermi', 'phil': ['python', 'haskell'], # Store each person's languages, keeping
'first': 'enrico', } # track of who respoded first.
'username': 'efermi', fav_languages = OrderedDict()
} # Show all responses for each person.
users.append(new_user) for name, langs in fav_languages.items(): fav_languages['jen'] = ['python', 'ruby']
print(name + ": ") fav_languages['sarah'] = ['c']
# Make another new user, and add them as well. for lang in langs: fav_languages['edward'] = ['ruby', 'go']
new_user = { print("- " + lang) fav_languages['phil'] = ['python', 'haskell']
'last': 'curie',
'first': 'marie', # Display the results, in the same order they
'username': 'mcurie', # were entered.
} You can store a dictionary inside another dictionary. In this for name, langs in fav_languages.items():
users.append(new_user) case each value associated with a key is itself a dictionary. print(name + ":")
for lang in langs:
Storing dictionaries in a dictionary print("- " + lang)
# Show all information about each user.
for user_dict in users: users = {
for k, v in user_dict.items(): 'aeinstein': {
print(k + ": " + v) 'first': 'albert',
'last': 'einstein', You can use a loop to generate a large number of
print("\n")
'location': 'princeton', dictionaries efficiently, if all the dictionaries start out with
You can also define a list of dictionaries directly, }, similar data.
without using append(): 'mcurie': { A million aliens
# Define a list of users, where each user 'first': 'marie',
'last': 'curie', aliens = []
# is represented by a dictionary.
users = [ 'location': 'paris',
}, # Make a million green aliens, worth 5 points
{ # each. Have them all start in one row.
'last': 'fermi', }
for alien_num in range(1000000):
'first': 'enrico', new_alien = {}
'username': 'efermi', for username, user_dict in users.items():
print("\nUsername: " + username) new_alien['color'] = 'green'
}, new_alien['points'] = 5
{ full_name = user_dict['first'] + " "
full_name += user_dict['last'] new_alien['x'] = 20 * alien_num
'last': 'curie', new_alien['y'] = 0
'first': 'marie', location = user_dict['location']
aliens.append(new_alien)
'username': 'mcurie',
}, print("\tFull name: " + full_name.title())
print("\tLocation: " + location.title()) # Prove the list contains a million aliens.
] num_aliens = len(aliens)
# Show all information about each user. print("Number of aliens created:")
for user_dict in users: Nesting is extremely useful in certain situations. However, print(num_aliens)
for k, v in user_dict.items(): be aware of making your code overly complex. If you're
print(k + ": " + v) nesting items much deeper than what you see here there
print("\n") are probably simpler ways of managing your data, such as More cheat sheets available at
using classes.
Testing numerical values is similar to testing string values. Several kinds of if statements exist. Your choice of which to
use depends on the number of conditions you need to test.
Testing equality and inequality You can have as many elif blocks as you need, and the
>>> age = 18 else block is always optional.
>>> age == 18 Simple if statement
True
>>> age != 18 age = 19
False
if age >= 18:
Comparison operators print("You're old enough to vote!")
>>> age = 19 If-else statements
>>> age < 21
True age = 17
>>> age <= 21
True if age >= 18:
If statements allow you to examine the current state >>> age > 21 print("You're old enough to vote!")
of a program and respond appropriately to that state. False else:
>>> age >= 21 print("You can't vote yet.")
You can write a simple if statement that checks one
False
condition, or you can create a complex series of if The if-elif-else chain
statements that idenitfy the exact conditions you're
age = 12
looking for.
You can check multiple conditions at the same time. The if age < 4:
While loops run as long as certain conditions remain and operator returns True if all the conditions listed are price = 0
true. You can use while loops to let your programs True. The or operator returns True if any condition is True. elif age < 18:
run as long as your users want them to. Using and to check multiple conditions price = 5
else:
>>> age_0 = 22 price = 10
>>> age_1 = 18
A conditional test is an expression that can be evaluated as >>> age_0 >= 21 and age_1 >= 21 print("Your cost is $" + str(price) + ".")
True or False. Python uses the values True and False to False
decide whether the code in an if statement should be >>> age_1 = 23
executed. >>> age_0 >= 21 and age_1 >= 21
True You can easily test whether a certain value is in a list. You
Checking for equality
A single equal sign assigns a value to a variable. A double equal can also test whether a list is empty before trying to loop
Using or to check multiple conditions through the list.
sign (==) checks whether two values are equal.
>>> age_0 = 22 Testing if a value is in a list
>>> car = 'bmw'
>>> age_1 = 18
>>> car == 'bmw' >>> players = ['al', 'bea', 'cyn', 'dale']
>>> age_0 >= 21 or age_1 >= 21
True >>> 'al' in players
True
>>> car = 'audi' True
>>> age_0 = 18
>>> car == 'bmw' >>> 'eric' in players
>>> age_0 >= 21 or age_1 >= 21
False False
False
Ignoring case when making a comparison
>>> car = 'Audi'
>>> car.lower() == 'audi' A boolean value is either True or False. Variables with
True boolean values are often used to keep track of certain
conditions within a program.
Checking for inequality Covers Python 3 and Python 2
Simple boolean values
>>> topping = 'mushrooms'
>>> topping != 'anchovies' game_active = True
True can_edit = False
Testing if a value is not in a list Letting the user choose when to quit Using continue in a loop
banned_users = ['ann', 'chad', 'dee'] prompt = "\nTell me something, and I'll " banned_users = ['eve', 'fred', 'gary', 'helen']
user = 'erin' prompt += "repeat it back to you."
prompt += "\nEnter 'quit' to end the program. " prompt = "\nAdd a player to your team."
if user not in banned_users: prompt += "\nEnter 'quit' when you're done. "
print("You can play!") message = ""
while message != 'quit': players = []
Checking if a list is empty message = input(prompt) while True:
players = [] player = input(prompt)
if message != 'quit': if player == 'quit':
if players: print(message) break
for player in players: elif player in banned_users:
Using a flag print(player + " is banned!")
print("Player: " + player.title())
else: prompt = "\nTell me something, and I'll " continue
print("We have no players yet!") prompt += "repeat it back to you." else:
prompt += "\nEnter 'quit' to end the program. " players.append(player)

active = True print("\nYour team:")


You can allow your users to enter input using the input() while active: for player in players:
statement. In Python 3, all input is stored as a string. message = input(prompt) print(player)

Simple input
if message == 'quit':
name = input("What's your name? ") active = False Every while loop needs a way to stop running so it won't
print("Hello, " + name + ".") else: continue to run forever. If there's no way for the condition to
print(message) become False, the loop will never stop running.
Accepting numerical input
Using break to exit a loop An infinite loop
age = input("How old are you? ")
age = int(age) prompt = "\nWhat cities have you visited?" while True:
prompt += "\nEnter 'quit' when you're done. " name = input("\nWho are you? ")
if age >= 18: print("Nice to meet you, " + name + "!")
print("\nYou can vote!") while True:
else: city = input(prompt)
print("\nYou can't vote yet.")
if city == 'quit': The remove() method removes a specific value from a list,
Accepting input in Python 2.7 break
Use raw_input() in Python 2.7. This function interprets all input as a
but it only removes the first instance of the value you
string, just as input() does in Python 3.
else: provide. You can use a while loop to remove all instances
print("I've been to " + city + "!") of a particular value.
name = raw_input("What's your name? ")
print("Hello, " + name + ".") Removing all cats from a list of pets
pets = ['dog', 'cat', 'dog', 'fish', 'cat',
Sublime Text doesn't run programs that prompt the user for
'rabbit', 'cat']
input. You can use Sublime Text to write programs that
prompt for input, but you'll need to run these programs from print(pets)
A while loop repeats a block of code as long as a condition
is True. a terminal.
while 'cat' in pets:
Counting to 5 pets.remove('cat')

current_number = 1 print(pets)
You can use the break statement and the continue
statement with any of Python's loops. For example you can
while current_number <= 5: use break to quit a for loop that's working through a list or a
print(current_number) dictionary. You can use continue to skip over certain items More cheat sheets available at
current_number += 1 when looping through a list or dictionary as well.
The two main kinds of arguments are positional and A function can return a value or a set of values. When a
keyword arguments. When you use positional arguments function returns a value, the calling line must provide a
Python matches the first argument in the function call with variable in which to store the return value. A function stops
the first parameter in the function definition, and so forth. running when it reaches a return statement.
With keyword arguments, you specify which parameter
each argument should be assigned to in the function call. Returning a single value
When you use keyword arguments, the order of the def get_full_name(first, last):
arguments doesn't matter. """Return a neatly formatted full name."""
Using positional arguments full_name = first + ' ' + last
return full_name.title()
def describe_pet(animal, name):
Functions are named blocks of code designed to do """Display information about a pet.""" musician = get_full_name('jimi', 'hendrix')
print("\nI have a " + animal + ".") print(musician)
one specific job. Functions allow you to write code
print("Its name is " + name + ".")
once that can then be run whenever you need to Returning a dictionary
accomplish the same task. Functions can take in the describe_pet('hamster', 'harry') def build_person(first, last):
information they need, and return the information they describe_pet('dog', 'willie') """Return a dictionary of information
generate. Using functions effectively makes your about a person.
programs easier to write, read, test, and fix. Using keyword arguments
"""
def describe_pet(animal, name): person = {'first': first, 'last': last}
"""Display information about a pet.""" return person
The first line of a function is its definition, marked by the print("\nI have a " + animal + ".")
keyword def. The name of the function is followed by a set print("Its name is " + name + ".") musician = build_person('jimi', 'hendrix')
of parentheses and a colon. A docstring, in triple quotes, print(musician)
describes what the function does. The body of a function is describe_pet(animal='hamster', name='harry')
describe_pet(name='willie', animal='dog') Returning a dictionary with optional values
indented one level.
To call a function, give the name of the function followed def build_person(first, last, age=None):
by a set of parentheses. """Return a dictionary of information
about a person.
Making a function You can provide a default value for a parameter. When
"""
function calls omit this argument the default value will be
def greet_user(): used. Parameters with default values must be listed after person = {'first': first, 'last': last}
"""Display a simple greeting.""" parameters without default values in the function's definition if age:
print("Hello!") so positional arguments can still work correctly. person['age'] = age
return person
greet_user() Using a default value
musician = build_person('jimi', 'hendrix', 27)
def describe_pet(name, animal='dog'):
print(musician)
"""Display information about a pet."""
Information that's passed to a function is called an print("\nI have a " + animal + ".")
musician = build_person('janis', 'joplin')
argument; information that's received by a function is called print("Its name is " + name + ".")
print(musician)
a parameter. Arguments are included in parentheses after
the function's name, and parameters are listed in describe_pet('harry', 'hamster')
parentheses in the function's definition. describe_pet('willie')
Try running some of these examples on pythontutor.com.
Passing a single argument Using None to make an argument optional
def greet_user(username): def describe_pet(animal, name=None):
"""Display a simple greeting.""" """Display information about a pet."""
print("Hello, " + username + "!") print("\nI have a " + animal + ".")
if name:
print("Its name is " + name + ".")
Covers Python 3 and Python 2
greet_user('jesse')
greet_user('diana')
greet_user('brandon') describe_pet('hamster', 'harry')
describe_pet('snake')
You can pass a list as an argument to a function, and the Sometimes you won't know how many arguments a You can store your functions in a separate file called a
function can work with the values in the list. Any changes function will need to accept. Python allows you to collect an module, and then import the functions you need into the file
the function makes to the list will affect the original list. You arbitrary number of arguments into one parameter using the containing your main program. This allows for cleaner
can prevent a function from modifying a list by passing a * operator. A parameter that accepts an arbitrary number of program files. (Make sure your module is stored in the
copy of the list as an argument. arguments must come last in the function definition. same directory as your main program.)
The ** operator allows a parameter to collect an arbitrary
Passing a list as an argument number of keyword arguments. Storing a function in a module
File: pizza.py
def greet_users(names): Collecting an arbitrary number of arguments
"""Print a simple greeting to everyone.""" def make_pizza(size, *toppings):
for name in names: def make_pizza(size, *toppings): """Make a pizza."""
msg = "Hello, " + name + "!" """Make a pizza.""" print("\nMaking a " + size + " pizza.")
print(msg) print("\nMaking a " + size + " pizza.") print("Toppings:")
print("Toppings:") for topping in toppings:
usernames = ['hannah', 'ty', 'margot'] for topping in toppings: print("- " + topping)
greet_users(usernames) print("- " + topping)
Importing an entire module
Allowing a function to modify a list File: making_pizzas.py
# Make three pizzas with different toppings. Every function in the module is available in the program file.
The following example sends a list of models to a function for
make_pizza('small', 'pepperoni')
printing. The original list is emptied, and the second list is filled. import pizza
make_pizza('large', 'bacon bits', 'pineapple')
def print_models(unprinted, printed): make_pizza('medium', 'mushrooms', 'peppers',
"""3d print a set of models.""" 'onions', 'extra cheese') pizza.make_pizza('medium', 'pepperoni')
while unprinted: pizza.make_pizza('small', 'bacon', 'pineapple')
current_model = unprinted.pop() Collecting an arbitrary number of keyword arguments
Importing a specific function
print("Printing " + current_model) def build_profile(first, last, **user_info): Only the imported functions are available in the program file.
printed.append(current_model) """Build a user's profile dictionary."""
# Build a dict with the required keys. from pizza import make_pizza
# Store some unprinted designs, profile = {'first': first, 'last': last}
# and print each of them. make_pizza('medium', 'pepperoni')
unprinted = ['phone case', 'pendant', 'ring'] # Add any other keys and values. make_pizza('small', 'bacon', 'pineapple')
printed = [] for key, value in user_info.items():
print_models(unprinted, printed) Giving a module an alias
profile[key] = value
import pizza as p
print("\nUnprinted:", unprinted) return profile
print("Printed:", printed) p.make_pizza('medium', 'pepperoni')
# Create two users with different kinds p.make_pizza('small', 'bacon', 'pineapple')
Preventing a function from modifying a list
The following example is the same as the previous one, except the # of information.
user_0 = build_profile('albert', 'einstein',
Giving a function an alias
original list is unchanged after calling print_models().
location='princeton') from pizza import make_pizza as mp
def print_models(unprinted, printed): user_1 = build_profile('marie', 'curie',
"""3d print a set of models.""" location='paris', field='chemistry') mp('medium', 'pepperoni')
while unprinted: mp('small', 'bacon', 'pineapple')
current_model = unprinted.pop() print(user_0)
print("Printing " + current_model) print(user_1) Importing all functions from a module
printed.append(current_model) Don't do this, but recognize it when you see it in others' code. It
can result in naming conflicts, which can cause errors.
# Store some unprinted designs, from pizza import *
# and print each of them. As you can see there are many ways to write and call a
original = ['phone case', 'pendant', 'ring'] function. When you're starting out, aim for something that make_pizza('medium', 'pepperoni')
printed = [] simply works. As you gain experience you'll develop an make_pizza('small', 'bacon', 'pineapple')
understanding of the more subtle advantages of different
print_models(original[:], printed) structures such as positional and keyword arguments, and
print("\nOriginal:", original) the various approaches to importing functions. For now if More cheat sheets available at
print("Printed:", printed) your functions do what you need them to, you're doing well.
If the class you're writing is a specialized version of another
Creating an object from a class class, you can use inheritance. When one class inherits
my_car = Car('audi', 'a4', 2016) from another, it automatically takes on all the attributes and
methods of the parent class. The child class is free to
Accessing attribute values introduce new attributes and methods, and override
attributes and methods of the parent class.
print(my_car.make)
To inherit from another class include the name of the
print(my_car.model)
parent class in parentheses when defining the new class.
print(my_car.year)
Classes are the foundation of object-oriented The __init__() method for a child class
Calling methods
programming. Classes represent real-world things
class ElectricCar(Car):
you want to model in your programs: for example my_car.fill_tank()
"""A simple model of an electric car."""
dogs, cars, and robots. You use a class to make my_car.drive()
objects, which are specific instances of dogs, cars, Creating multiple objects def __init__(self, make, model, year):
and robots. A class defines the general behavior that """Initialize an electric car."""
a whole category of objects can have, and the my_car = Car('audi', 'a4', 2016) super().__init__(make, model, year)
information that can be associated with those objects. my_old_car = Car('subaru', 'outback', 2013)
my_truck = Car('toyota', 'tacoma', 2010)
Classes can inherit from each other – you can # Attributes specific to electric cars.
write a class that extends the functionality of an # Battery capacity in kWh.
existing class. This allows you to code efficiently for a self.battery_size = 70
# Charge level in %.
wide variety of situations. You can modify an attribute's value directly, or you can
self.charge_level = 0
write methods that manage updating values more carefully.
Modifying an attribute directly Adding new methods to the child class
Consider how we might model a car. What information class ElectricCar(Car):
my_new_car = Car('audi', 'a4', 2016)
would we associate with a car, and what behavior would it --snip--
my_new_car.fuel_level = 5
have? The information is stored in variables called def charge(self):
attributes, and the behavior is represented by functions. Writing a method to update an attribute's value """Fully charge the vehicle."""
Functions that are part of a class are called methods. self.charge_level = 100
def update_fuel_level(self, new_level): print("The vehicle is fully charged.")
The Car class """Update the fuel level."""
class Car():
if new_level <= self.fuel_capacity: Using child methods and parent methods
self.fuel_level = new_level
"""A simple attempt to model a car.""" my_ecar = ElectricCar('tesla', 'model s', 2016)
else:
print("The tank can't hold that much!")
def __init__(self, make, model, year): my_ecar.charge()
"""Initialize car attributes.""" Writing a method to increment an attribute's value my_ecar.drive()
self.make = make
self.model = model def add_fuel(self, amount):
self.year = year """Add fuel to the tank."""
if (self.fuel_level + amount
# Fuel capacity and level in gallons. <= self.fuel_capacity): There are many ways to model real world objects and
self.fuel_capacity = 15 self.fuel_level += amount situations in code, and sometimes that variety can feel
self.fuel_level = 0 print("Added fuel.") overwhelming. Pick an approach and try it – if your first
else: attempt doesn't work, try a different approach.
def fill_tank(self): print("The tank won't hold that much.")
"""Fill gas tank to capacity."""
self.fuel_level = self.fuel_capacity
print("Fuel tank is full.")
In Python class names are written in CamelCase and object Covers Python 3 and Python 2
def drive(self): names are written in lowercase with underscores. Modules
"""Simulate driving.""" that contain classes should still be named in lowercase with
print("The car is moving.") underscores.
Class files can get long as you add detailed information and
Overriding parent methods functionality. To help keep your program files uncluttered, Classes should inherit from object
class ElectricCar(Car): you can store your classes in modules and import the class ClassName(object):
--snip-- classes you need into your main program.
def fill_tank(self): The Car class in Python 2.7
Storing classes in a file
"""Display an error message.""" car.py class Car(object):
print("This car has no fuel tank!")
"""Represent gas and electric cars.""" Child class __init__() method is different
class ChildClassName(ParentClass):
class Car():
def __init__(self):
A class can have objects as attributes. This allows classes """A simple attempt to model a car."""
super(ClassName, self).__init__()
to work together to model complex situations. --snip—
The ElectricCar class in Python 2.7
A Battery class class Battery():
"""A battery for an electric car.""" class ElectricCar(Car):
class Battery(): def __init__(self, make, model, year):
"""A battery for an electric car.""" --snip--
super(ElectricCar, self).__init__(
class ElectricCar(Car): make, model, year)
def __init__(self, size=70):
"""Initialize battery attributes.""" """A simple model of an electric car."""
# Capacity in kWh, charge level in %. --snip--
self.size = size Importing individual classes from a module A list can hold as many items as you want, so you can
self.charge_level = 0 my_cars.py make a large number of objects from a class and store
them in a list.
def get_range(self): from car import Car, ElectricCar Here's an example showing how to make a fleet of rental
"""Return the battery's range.""" cars, and make sure all the cars are ready to drive.
if self.size == 70: my_beetle = Car('volkswagen', 'beetle', 2016)
my_beetle.fill_tank() A fleet of rental cars
return 240
elif self.size == 85: my_beetle.drive() from car import Car, ElectricCar
return 270
my_tesla = ElectricCar('tesla', 'model s', # Make lists to hold a fleet of cars.
Using an instance as an attribute 2016) gas_fleet = []
my_tesla.charge() electric_fleet = []
class ElectricCar(Car):
my_tesla.drive()
--snip--
Importing an entire module # Make 500 gas cars and 250 electric cars.
def __init__(self, make, model, year): for _ in range(500):
"""Initialize an electric car.""" import car car = Car('ford', 'focus', 2016)
super().__init__(make, model, year) gas_fleet.append(car)
my_beetle = car.Car( for _ in range(250):
# Attribute specific to electric cars. 'volkswagen', 'beetle', 2016) ecar = ElectricCar('nissan', 'leaf', 2016)
self.battery = Battery() my_beetle.fill_tank() electric_fleet.append(ecar)
my_beetle.drive()
def charge(self): # Fill the gas cars, and charge electric cars.
"""Fully charge the vehicle.""" my_tesla = car.ElectricCar( for car in gas_fleet:
self.battery.charge_level = 100 'tesla', 'model s', 2016) car.fill_tank()
print("The vehicle is fully charged.") my_tesla.charge() for ecar in electric_fleet:
my_tesla.drive() ecar.charge()
Using the instance
Importing all classes from a module
my_ecar = ElectricCar('tesla', 'model x', 2016) (Don’t do this, but recognize it when you see it.) print("Gas cars:", len(gas_fleet))
print("Electric cars:", len(electric_fleet))
from car import *
my_ecar.charge()
print(my_ecar.battery.get_range())
my_beetle = Car('volkswagen', 'beetle', 2016)
More cheat sheets available at
my_ecar.drive()
Storing the lines in a list Opening a file using an absolute path
filename = 'siddhartha.txt' f_path = "/home/ehmatthes/books/alice.txt"

with open(filename) as f_obj: with open(f_path) as f_obj:


lines = f_obj.readlines() lines = f_obj.readlines()

for line in lines: Opening a file on Windows


Windows will sometimes interpret forward slashes incorrectly. If
print(line.rstrip())
you run into this, use backslashes in your file paths.
f_path = "C:\Users\ehmatthes\books\alice.txt"
Your programs can read information in from files, and Passing the 'w' argument to open() tells Python you want to
they can write data to files. Reading from files allows with open(f_path) as f_obj:
write to the file. Be careful; this will erase the contents of lines = f_obj.readlines()
you to work with a wide variety of information; writing the file if it already exists. Passing the 'a' argument tells
to files allows users to pick up where they left off the Python you want to append to the end of an existing file.
next time they run your program. You can write text to
files, and you can store Python structures such as Writing to an empty file When you think an error may occur, you can write a try-
lists in data files. filename = 'programming.txt' except block to handle the exception that might be raised.
The try block tells Python to try running some code, and the
with open(filename, 'w') as f: except block tells Python what to do if the code results in a
Exceptions are special objects that help your
f.write("I love programming!") particular kind of error.
programs respond to errors in appropriate ways. For
example if your program tries to open a file that Writing multiple lines to an empty file Handling the ZeroDivisionError exception
doesn’t exist, you can use exceptions to display an try:
informative error message instead of having the filename = 'programming.txt'
print(5/0)
program crash. except ZeroDivisionError:
with open(filename, 'w') as f:
f.write("I love programming!\n") print("You can't divide by zero!")
f.write("I love creating new games.\n") Handling the FileNotFoundError exception
To read from a file your program needs to open the file and
Appending to a file f_name = 'siddhartha.txt'
then read the contents of the file. You can read the entire
contents of the file at once, or read the file line by line. The filename = 'programming.txt'
with statement makes sure the file is closed properly when try:
the program has finished accessing the file. with open(filename, 'a') as f: with open(f_name) as f_obj:
f.write("I also love working with data.\n") lines = f_obj.readlines()
Reading an entire file at once f.write("I love making apps as well.\n") except FileNotFoundError:
filename = 'siddhartha.txt' msg = "Can't find file {0}.".format(f_name)
print(msg)
with open(filename) as f_obj:
contents = f_obj.read() When Python runs the open() function, it looks for the file in
the same directory where the program that's being excuted
is stored. You can open a file from a subfolder using a It can be hard to know what kind of exception to handle
print(contents) when writing code. Try writing your code without a try block,
relative path. You can also use an absolute path to open
Reading line by line any file on your system. and make it generate an error. The traceback will tell you
Each line that's read from the file has a newline character at the what kind of exception your program needs to handle.
end of the line, and the print function adds its own newline Opening a file from a subfolder
character. The rstrip() method gets rid of the the extra blank lines
this would result in when printing to the terminal.
f_path = "text_files/alice.txt"

filename = 'siddhartha.txt' with open(f_path) as f_obj:


lines = f_obj.readlines()
Covers Python 3 and Python 2
with open(filename) as f_obj:
for line in f_obj: for line in lines:
print(line.rstrip()) print(line.rstrip())
The try block should only contain code that may cause an Sometimes you want your program to just continue running The json module allows you to dump simple Python data
error. Any code that depends on the try block running when it encounters an error, without reporting the error to structures into a file, and load the data from that file the
successfully should be placed in the else block. the user. Using the pass statement in an else block allows next time the program runs. The JSON data format is not
you to do this. specific to Python, so you can share this kind of data with
Using an else block people who work in other languages as well.
Using the pass statement in an else block
print("Enter two numbers. I'll divide them.")
f_names = ['alice.txt', 'siddhartha.txt', Knowing how to manage exceptions is important when
x = input("First number: ") 'moby_dick.txt', 'little_women.txt'] working with stored data. You'll usually want to make sure
y = input("Second number: ") the data you're trying to load exists before working with it.
for f_name in f_names: Using json.dump() to store data
try: # Report the length of each file found.
result = int(x) / int(y) try: """Store some numbers."""
except ZeroDivisionError: with open(f_name) as f_obj:
print("You can't divide by zero!") lines = f_obj.readlines() import json
else: except FileNotFoundError:
print(result) # Just move on to the next file. numbers = [2, 3, 5, 7, 11, 13]
pass
Preventing crashes from user input else: filename = 'numbers.json'
Without the except block in the following example, the program with open(filename, 'w') as f_obj:
num_lines = len(lines)
would crash if the user tries to divide by zero. As written, it will
handle the error gracefully and keep running. msg = "{0} has {1} lines.".format( json.dump(numbers, f_obj)
f_name, num_lines)
"""A simple calculator for division only.""" print(msg)
Using json.load() to read data
"""Load some previously stored numbers."""
print("Enter two numbers. I'll divide them.")
print("Enter 'q' to quit.") import json
Exception-handling code should catch specific exceptions
while True: that you expect to happen during your program's execution. filename = 'numbers.json'
x = input("\nFirst number: ") A bare except block will catch all exceptions, including with open(filename) as f_obj:
if x == 'q': keyboard interrupts and system exits you might need when numbers = json.load(f_obj)
break forcing a program to close.
y = input("Second number: ") print(numbers)
if y == 'q': If you want to use a try block and you're not sure which
break exception to catch, use Exception. It will catch most Making sure the stored data exists
exceptions, but still allow you to interrupt programs
intentionally. import json
try:
result = int(x) / int(y) Don’t use bare except blocks f_name = 'numbers.json'
except ZeroDivisionError:
print("You can't divide by zero!") try:
try:
else: # Do something
with open(f_name) as f_obj:
print(result) except:
numbers = json.load(f_obj)
pass
except FileNotFoundError:
Use Exception instead msg = "Can’t find {0}.".format(f_name)
Well-written, properly tested code is not very prone to print(msg)
try: else:
internal errors such as syntax or logical errors. But every # Do something
time your program depends on something external such as print(numbers)
except Exception:
user input or the existence of a file, there's a possibility of pass
an exception being raised. Practice with exceptions
Printing the exception Take a program you've already written that prompts for user
It's up to you how to communicate errors to your users. input, and add some error-handling code to the program.
Sometimes users need to know if a file is missing; try:
sometimes it's better to handle the error silently. A little # Do something
except Exception as e: More cheat sheets available at
experience will help you know how much to report.
print(e, type(e))
Building a testcase with one unit test Running the test
To build a test case, make a class that inherits from When you change your code, it’s important to run your existing
unittest.TestCase and write methods that begin with test_. tests. This will tell you whether the changes you made affected
Save this as test_full_names.py existing behavior.

import unittest E
from full_names import get_full_name ================================================
ERROR: test_first_last (__main__.NamesTestCase)
class NamesTestCase(unittest.TestCase): Test names like Janis Joplin.
"""Tests for names.py.""" ------------------------------------------------
Traceback (most recent call last):
def test_first_last(self): File "test_full_names.py", line 10,
When you write a function or a class, you can also """Test names like Janis Joplin.""" in test_first_last
write tests for that code. Testing proves that your full_name = get_full_name('janis', 'joplin')
code works as it's supposed to in the situations it's 'joplin') TypeError: get_full_name() missing 1 required
designed to handle, and also when people use your self.assertEqual(full_name, positional argument: 'last'
programs in unexpected ways. Writing tests gives 'Janis Joplin')
you confidence that your code will work correctly as ------------------------------------------------
more people begin to use your programs. You can unittest.main() Ran 1 test in 0.001s
also add new features to your programs and know Running the test
that you haven't broken existing behavior. FAILED (errors=1)
Python reports on each unit test in the test case. The dot reports a
single passing test. Python informs us that it ran 1 test in less than Fixing the code
A unit test verifies that one specific aspect of your 0.001 seconds, and the OK lets us know that all unit tests in the When a test fails, the code needs to be modified until the test
test case passed. passes again. (Don’t make the mistake of rewriting your tests to fit
code works as it's supposed to. A test case is a
your new code.) Here we can make the middle name optional.
collection of unit tests which verify your code's .
behavior in a wide variety of situations. --------------------------------------- def get_full_name(first, last, middle=''):
Ran 1 test in 0.000s """Return a full name."""
if middle:
OK full_name = "{0} {1} {2}".format(first,
Python's unittest module provides tools for testing your middle, last)
code. To try it out, we’ll create a function that returns a full else:
name. We’ll use the function in a regular program, and then full_name = "{0} {1}".format(first,
build a test case for the function. Failing tests are important; they tell you that a change in the
last)
code has affected existing behavior. When a test fails, you
A function to test return full_name.title()
need to modify the code so the existing behavior still works.
Save this as full_names.py
Modifying the function Running the test
def get_full_name(first, last): Now the test should pass again, which means our original
We’ll modify get_full_name() so it handles middle names, but
"""Return a full name.""" functionality is still intact.
we’ll do it in a way that breaks existing behavior.
full_name = "{0} {1}".format(first, last) .
return full_name.title() def get_full_name(first, middle, last):
---------------------------------------
"""Return a full name."""
Ran 1 test in 0.000s
Using the function full_name = "{0} {1} {2}".format(first,
Save this as names.py middle, last)
OK
return full_name.title()
from full_names import get_full_name
Using the function
janis = get_full_name('janis', 'joplin')
print(janis) from full_names import get_full_name

bob = get_full_name('bob', 'dylan') john = get_full_name('john', 'lee', 'hooker')


print(john)
Covers Python 3 and Python 2
print(bob)

david = get_full_name('david', 'lee', 'roth')


print(david)
You can add as many unit tests to a test case as you need. Testing a class is similar to testing a function, since you’ll When testing a class, you usually have to make an instance
To write a new test, add a new method to your test case mostly be testing your methods. of the class. The setUp() method is run before every test.
class. Any instances you make in setUp() are available in every
A class to test test you write.
Testing middle names Save as accountant.py
We’ve shown that get_full_name() works for first and last Using setUp() to support multiple tests
names. Let’s test that it works for middle names as well.
class Accountant():
The instance self.acc can be used in each new test.
"""Manage a bank account."""
import unittest import unittest
from full_names import get_full_name def __init__(self, balance=0): from accountant import Accountant
self.balance = balance
class NamesTestCase(unittest.TestCase): class TestAccountant(unittest.TestCase):
"""Tests for names.py.""" def deposit(self, amount): """Tests for the class Accountant."""
self.balance += amount
def test_first_last(self): def setUp(self):
"""Test names like Janis Joplin.""" def withdraw(self, amount): self.acc = Accountant()
full_name = get_full_name('janis', self.balance -= amount
'joplin') def test_initial_balance(self):
self.assertEqual(full_name, Building a testcase # Default balance should be 0.
'Janis Joplin') For the first test, we’ll make sure we can start out with different
initial balances. Save this as test_accountant.py. self.assertEqual(self.acc.balance, 0)

def test_middle(self): import unittest # Test non-default balance.


"""Test names like David Lee Roth.""" from accountant import Accountant acc = Accountant(100)
full_name = get_full_name('david', self.assertEqual(acc.balance, 100)
'roth', 'lee') class TestAccountant(unittest.TestCase):
self.assertEqual(full_name, """Tests for the class Accountant.""" def test_deposit(self):
'David Lee Roth') # Test single deposit.
def test_initial_balance(self): self.acc.deposit(100)
unittest.main() # Default balance should be 0. self.assertEqual(self.acc.balance, 100)
acc = Accountant()
Running the tests self.assertEqual(acc.balance, 0)
The two dots represent two passing tests. # Test multiple deposits.
self.acc.deposit(100)
.. # Test non-default balance. self.acc.deposit(100)
--------------------------------------- acc = Accountant(100) self.assertEqual(self.acc.balance, 300)
Ran 2 tests in 0.000s self.assertEqual(acc.balance, 100)
def test_withdrawal(self):
OK unittest.main() # Test single withdrawal.
self.acc.deposit(1000)
Running the test
self.acc.withdraw(100)
. self.assertEqual(self.acc.balance, 900)
Python provides a number of assert methods you can use
to test your code. ---------------------------------------
Ran 1 test in 0.000s unittest.main()
Verify that a==b, or a != b
OK Running the tests
assertEqual(a, b)
assertNotEqual(a, b) ...
---------------------------------------
Verify that x is True, or x is False In general you shouldn’t modify a test once it’s written. Ran 3 tests in 0.001s
assertTrue(x) When a test fails it usually means new code you’ve written
has broken existing functionality, and you need to modify OK
assertFalse(x)
the new code until all existing tests pass.
Verify an item is in a list, or not in a list If your original requirements have changed, it may be
appropriate to modify some tests. This usually happens in More cheat sheets available at
assertIn(item, list) the early stages of a project when desired behavior is still
assertNotIn(item, list) being sorted out.
The following code sets up an empty game window, and
starts an event loop and a loop that continually refreshes Useful rect attributes
Once you have a rect object, there are a number of attributes that
the screen.
are useful when positioning objects and detecting relative positions
An empty game window of objects. (You can find more attributes in the Pygame
documentation.)
import sys
# Individual x and y values:
import pygame as pg
screen_rect.left, screen_rect.right
screen_rect.top, screen_rect.bottom
Pygame is a framework for making games using def run_game():
screen_rect.centerx, screen_rect.centery
Python. Making games is fun, and it’s a great way to # Initialize and set up screen.
screen_rect.width, screen_rect.height
expand your programming skills and knowledge. pg.init()
Pygame takes care of many of the lower-level tasks screen = pg.display.set_mode((1200, 800))
# Tuples
in building games, which lets you focus on the pg.display.set_caption("Alien Invasion")
screen_rect.center
aspects of your game that make it interesting. screen_rect.size
# Start main loop.
while True: Creating a rect object
# Start event loop. You can create a rect object from scratch. For example a small rect
Pygame runs on all systems, but setup is slightly different for event in pg.event.get(): object that’s filled in can represent a bullet in a game. The Rect()
if event.type == pg.QUIT: class takes the coordinates of the upper left corner, and the width
on each OS. The instructions here assume you’re using and height of the rect. The draw.rect() function takes a screen
Python 3, and provide a minimal installation of Pygame. If sys.exit()
object, a color, and a rect. This function fills the given rect with the
these instructions don’t work for your system, see the more given color.
detailed notes at http://ehmatthes.github.io/pcc/. # Refresh screen.
pg.display.flip() bullet_rect = pg.Rect(100, 100, 3, 15)
Pygame on Linux color = (100, 100, 100)
$ sudo apt-get install python3-dev mercurial run_game() pg.draw.rect(screen, color, bullet_rect)
libsdl-image1.2-dev libsdl2-dev Setting a custom window size
libsdl-ttf2.0-dev The display.set_mode() function accepts a tuple that defines the
$ pip install --user screen size. Many objects in a game are images that are moved around
hg+http://bitbucket.org/pygame/pygame the screen. It’s easiest to use bitmap (.bmp) image files, but
screen_dim = (1200, 800)
screen = pg.display.set_mode(screen_dim) you can also configure your system to work with jpg, png,
Pygame on OS X and gif files as well.
This assumes you’ve used Homebrew to install Python 3.
Setting a custom background color
$ brew install hg sdl sdl_image sdl_ttf Colors are defined as a tuple of red, green, and blue values. Each Loading an image
$ pip install --user value ranges from 0-255. ship = pg.image.load('images/ship.bmp')
hg+http://bitbucket.org/pygame/pygame bg_color = (230, 230, 230)
Getting the rect object from an image
Pygame on Windows screen.fill(bg_color)
Find an installer at ship_rect = ship.get_rect()
https://bitbucket.org/pygame/pygame/downloads/ or
http://www.lfd.uci.edu/~gohlke/pythonlibs/#pygame that matches Positioning an image
your version of Python. Run the installer file if it’s a .exe or .msi file. Many objects in a game can be treated as simple With rects, it’s easy to position an image wherever you want on the
If it’s a .whl file, use pip to install Pygame: rectangles, rather than their actual shape. This simplifies screen, or in relation to another object. The following code
positions a ship object at the bottom center of the screen.
> python –m pip install --user code without noticeably affecting game play. Pygame has a
pygame-1.9.2a0-cp35-none-win32.whl rect object that makes it easy to work with game objects. ship_rect.midbottom = screen_rect.midbottom

Testing your installation Getting the screen rect object


We already have a screen object; we can easily access the rect
To test your installation, open a terminal session and try to import
object associated with the screen.
Pygame. If you don’t get any error messages, your installation was
successful. screen_rect = screen.get_rect()
$ python Covers Python 3 and Python 2
Finding the center of the screen
>>> import pygame Rect objects have a center attribute which stores the center point.
>>>
screen_center = screen_rect.center
Pygame’s event loop registers an event any time the
Drawing an image to the screen mouse moves, or a mouse button is pressed or released. Removing an item from a group
Once an image is loaded and positioned, you can draw it to the It’s important to delete elements that will never appear again in the
screen with the blit() method. The blit() method acts on the screen Responding to the mouse button game, so you don’t waste memory and resources.
object, and takes the image object and image rect as arguments.
for event in pg.event.get(): bullets.remove(bullet)
# Draw ship to screen. if event.type == pg.MOUSEBUTTONDOWN:
screen.blit(ship, ship_rect) ship.fire_bullet()
The blitme() method Finding the mouse position You can detect when a single object collides with any
Game objects such as ships are often written as classes. Then a The mouse position is returned as a tuple. member of a group. You can also detect when any member
blitme() method is usually defined, which draws the object to the of one group collides with a member of another group.
screen. mouse_pos = pg.mouse.get_pos()
def blitme(self):
Collisions between a single object and a group
Clicking a button The spritecollideany() function takes an object and a group, and
"""Draw ship at current location.""" You might want to know if the cursor is over an object such as a returns True if the object overlaps with any member of the group.
self.screen.blit(self.image, self.rect) button. The rect.collidepoint() method returns true when a point is
inside a rect object. if pg.sprite.spritecollideany(ship, aliens):
ships_left -= 1
if button_rect.collidepoint(mouse_pos):
Pygame watches for events such as key presses and start_game() Collisions between two groups
mouse actions. You can detect any event you care about in The sprite.groupcollide() function takes two groups, and two
Hiding the mouse booleans. The function returns a dictionary containing information
the event loop, and respond with any action that’s
about the members that have collided. The booleans tell Pygame
appropriate for your game. pg.mouse.set_visible(False) whether to delete the members of either group that have collided.
Responding to key presses collisions = pg.sprite.groupcollide(
Pygame’s main event loop registers a KEYDOWN event any time a bullets, aliens, True, True)
key is pressed. When this happens, you can check for specific Pygame has a Group class which makes working with a
keys.
group of similar objects easier. A group is like a list, with score += len(collisions) * alien_point_value
for event in pg.event.get(): some extra functionality that’s helpful when building games.
if event.type == pg.KEYDOWN:
Making and filling a group
if event.key == pg.K_RIGHT: An object that will be placed in a group must inherit from Sprite.
ship_rect.x += 1 You can use text for a variety of purposes in a game. For
elif event.key == pg.K_LEFT: from pygame.sprite import Sprite, Group example you can share information with players, and you
ship_rect.x -= 1 can display a score.
elif event.key == pg.K_SPACE: def Bullet(Sprite): Displaying a message
ship.fire_bullet() ... The following code defines a message, then a color for the text and
elif event.key == pg.K_q: def draw_bullet(self): the background color for the message. A font is defined using the
sys.exit() ... default system font, with a font size of 48. The font.render()
def update(self): function is used to create an image of the message, and we get the
Responding to released keys ... rect object associated with the image. We then center the image
When the user releases a key, a KEYUP event is triggered. on the screen and display it.

if event.type == pg.KEYUP: bullets = Group() msg = "Play again?"


if event.key == pg.K_RIGHT: msg_color = (100, 100, 100)
ship.moving_right = False new_bullet = Bullet() bg_color = (230, 230, 230)
bullets.add(new_bullet)
f = pg.font.SysFont(None, 48)
Looping through the items in a group msg_image = f.render(msg, True, msg_color,
The Pygame documentation is really helpful when building The sprites() method returns all the members of a group.
bg_color)
your own games. The home page for the Pygame project is for bullet in bullets.sprites(): msg_image_rect = msg_image.get_rect()
at http://pygame.org/, and the home page for the bullet.draw_bullet() msg_image_rect.center = screen_rect.center
documentation is at http://pygame.org/docs/. screen.blit(msg_image, msg_image_rect)
The most useful part of the documentation are the pages Calling update() on a group
about specific parts of Pygame, such as the Rect() class Calling update() on a group automatically calls update() on each
member of the group.
and the sprite module. You can find a list of these elements More cheat sheets available at
at the top of the help pages. bullets.update()
Making a scatter plot Emphasizing points
The scatter() function takes a list of x values and a list of y values, You can plot as much data as you want on one plot. Here we re-
and a variety of optional arguments. The s=10 argument controls plot the first and last points larger to emphasize them.
the size of each point.
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
x_values = list(range(1000))
x_values = list(range(1000)) squares = [x**2 for x in x_values]
squares = [x**2 for x in x_values] plt.scatter(x_values, squares, c=squares,
cmap=plt.cm.Blues, edgecolor='none',
plt.scatter(x_values, squares, s=10) s=10)
plt.show()
Data visualization involves exploring data through plt.scatter(x_values[0], squares[0], c='green',
visual representations. The matplotlib package helps edgecolor='none', s=100)
you make visually appealing representations of the plt.scatter(x_values[-1], squares[-1], c='red',
data you’re working with. matplotlib is extremely Plots can be customized in a wide variety of ways. Just
edgecolor='none', s=100)
flexible; these examples will help you get started with about any element of a plot can be customized.
a few simple visualizations. Adding titles and labels, and scaling axes plt.title("Square Numbers", fontsize=24)
--snip--
import matplotlib.pyplot as plt
Removing axes
matplotlib runs on all systems, but setup is slightly different x_values = list(range(1000)) You can customize or remove axes entirely. Here’s how to access
depending on your OS. If the minimal instructions here squares = [x**2 for x in x_values] each axis, and hide it.
don’t work for you, see the more detailed instructions at plt.scatter(x_values, squares, s=10) plt.axes().get_xaxis().set_visible(False)
http://ehmatthes.github.io/pcc/. You should also consider plt.axes().get_yaxis().set_visible(False)
installing the Anaconda distrubution of Python from plt.title("Square Numbers", fontsize=24)
https://continuum.io/downloads/, which includes matplotlib. plt.xlabel("Value", fontsize=18) Setting a custom figure size
plt.ylabel("Square of Value", fontsize=18) You can make your plot as big or small as you want. Before
matplotlib on Linux plt.tick_params(axis='both', which='major', plotting your data, add the following code. The dpi argument is
optional; if you don’t know your system’s resolution you can omit
$ sudo apt-get install python3-matplotlib labelsize=14)
the argument and adjust the figsize argument accordingly.
plt.axis([0, 1100, 0, 1100000])
matplotlib on OS X plt.figure(dpi=128, figsize=(10, 6))
Start a terminal session and enter import matplotlib to see if plt.show()
it’s already installed on your system. If not, try this command: Saving a plot
$ pip install --user matplotlib Using a colormap The matplotlib viewer has an interactive save button, but you can
A colormap varies the point colors from one shade to another, also save your visualizations programmatically. To do so, replace
based on a certain value for each point. The value used to plt.show() with plt.savefig(). The bbox_inches='tight'
matplotlib on Windows argument trims extra whitespace from the plot.
You first need to install Visual Studio, which you can do from determine the color of each point is passed to the c argument, and
https://dev.windows.com/. The Community edition is free. Then go the cmap argument specifies which colormap to use.
plt.savefig('squares.png', bbox_inches='tight')
to https://pypi.python.org/pypi/matplotlib/ or The edgecolor='none' argument removes the black outline
http://www.lfd.uic.edu/~gohlke/pythonlibs/#matplotlib and download from each point.
an appropriate installer file.
plt.scatter(x_values, squares, c=squares,
cmap=plt.cm.Blues, edgecolor='none', The matplotlib gallery and documentation are at
s=10) http://matplotlib.org/. Be sure to visit the examples, gallery,
and pyplot links.
Making a line graph
import matplotlib.pyplot as plt

x_values = [0, 1, 2, 3, 4, 5]
squares = [0, 1, 4, 9, 16, 25]
Covers Python 3 and Python 2
plt.plot(x_values, squares)
plt.show()
You can make as many plots as you want on one figure. You can include as many individual graphs in one figure as
When you make multiple plots, you can emphasize Datetime formatting arguments you want. This is useful, for example, when comparing
The strftime() function generates a formatted string from a
relationships in the data. For example you can fill the space related datasets.
datetime object, and the strptime() function genereates a
between two sets of data. datetime object from a string. The following codes let you work with Sharing an x-axis
Plotting two sets of data dates exactly as you need to. The following code plots a set of squares and a set of cubes on
Here we use plt.scatter() twice to plot square numbers and %A Weekday name, such as Monday two separate graphs that share a common x-axis.
cubes on the same figure. The plt.subplots() function returns a figure object and a tuple
%B Month name, such as January of axes. Each set of axes corresponds to a separate plot in the
import matplotlib.pyplot as plt %m Month, as a number (01 to 12) figure. The first two arguments control the number of rows and
%d Day of the month, as a number (01 to 31) columns generated in the figure.
x_values = list(range(11)) %Y Four-digit year, such as 2016
%y Two-digit year, such as 16 import matplotlib.pyplot as plt
squares = [x**2 for x in x_values]
cubes = [x**3 for x in x_values] %H Hour, in 24-hour format (00 to 23)
%I Hour, in 12-hour format (01 to 12) x_vals = list(range(11))
%p AM or PM squares = [x**2 for x in x_vals]
plt.scatter(x_values, squares, c='blue',
%M Minutes (00 to 59) cubes = [x**3 for x in x_vals]
edgecolor='none', s=20)
plt.scatter(x_values, cubes, c='red', %S Seconds (00 to 61)
fig, axarr = plt.subplots(2, 1, sharex=True)
edgecolor='none', s=20)
Converting a string to a datetime object
axarr[0].scatter(x_vals, squares)
plt.axis([0, 11, 0, 1100]) new_years = dt.strptime('1/1/2017', '%m/%d/%Y')
axarr[0].set_title('Squares')
plt.show()
Converting a datetime object to a string
Filling the space between data sets axarr[1].scatter(x_vals, cubes, c='red')
The fill_between() method fills the space between two data ny_string = dt.strftime(new_years, '%B %d, %Y') axarr[1].set_title('Cubes')
sets. It takes a series of x-values and two series of y-values. It also print(ny_string)
takes a facecolor to use for the fill, and an optional alpha plt.show()
argument that controls the color’s transparency. Plotting high temperatures
The following code creates a list of dates and a corresponding list
Sharing a y-axis
plt.fill_between(x_values, cubes, squares, of high temperatures. It then plots the high temperatures, with the
To share a y-axis, we use the sharey=True argument.
facecolor='blue', alpha=0.25) date labels displayed in a specific format.
from datetime import datetime as dt import matplotlib.pyplot as plt

import matplotlib.pyplot as plt x_vals = list(range(11))


Many interesting data sets have a date or time as the x- squares = [x**2 for x in x_vals]
value. Python’s datetime module helps you work with this from matplotlib import dates as mdates
cubes = [x**3 for x in x_vals]
kind of data.
dates = [
fig, axarr = plt.subplots(1, 2, sharey=True)
Generating the current date dt(2016, 6, 21), dt(2016, 6, 22),
The datetime.now() function returns a datetime object dt(2016, 6, 23), dt(2016, 6, 24),
representing the current date and time. ] axarr[0].scatter(x_vals, squares)
axarr[0].set_title('Squares')
from datetime import datetime as dt
highs = [57, 68, 64, 59]
axarr[1].scatter(x_vals, cubes, c='red')
today = dt.now() axarr[1].set_title('Cubes')
date_string = dt.strftime(today, '%m/%d/%Y') fig = plt.figure(dpi=128, figsize=(10,6))
print(date_string) plt.plot(dates, highs, c='red')
plt.title("Daily High Temps", fontsize=24) plt.show()
Generating a specific date plt.ylabel("Temp (F)", fontsize=16)
You can also generate a datetime object for any date and time you
want. The positional order of arguments is year, month, and day. x_axis = plt.axes().get_xaxis()
The hour, minute, second, and microsecond arguments are
x_axis.set_major_formatter(
optional.
mdates.DateFormatter('%B %d %Y')
from datetime import datetime as dt )
fig.autofmt_xdate()
new_years = dt(2017, 1, 1) More cheat sheets available at
fall_equinox = dt(year=2016, month=9, day=22) plt.show()
You can add as much data as you want when making a
Making a scatter plot visualization.
The data for a scatter plot needs to be a list containing tuples of
the form (x, y). The stroke=False argument tells Pygal to make Plotting squares and cubes
an XY chart with no line connecting the points.
import pygal
import pygal
x_values = list(range(11))
squares = [ squares = [x**2 for x in x_values]
(0, 0), (1, 1), (2, 4), (3, 9), cubes = [x**3 for x in x_values]
Data visualization involves exploring data through (4, 16), (5, 25),
visual representations. Pygal helps you make visually ] chart = pygal.Line()
appealing representations of the data you’re working
chart.force_uri_protocol = 'http'
with. Pygal is particularly well suited for visualizations chart = pygal.XY(stroke=False) chart.title = "Squares and Cubes"
that will be presented online, because it supports chart.force_uri_protocol = 'http' chart.x_labels = x_values
interactive elements. chart.add('x^2', squares)
chart.render_to_file('squares.svg') chart.add('Squares', squares)
Using a list comprehension for a scatter plot chart.add('Cubes', cubes)
Pygal can be installed using pip. A list comprehension can be used to effficiently make a dataset for chart.render_to_file('squares_cubes.svg')
a scatter plot.
Pygal on Linux and OS X Filling the area under a data series
squares = [(x, x**2) for x in range(1000)] Pygal allows you to fill the area under or over each series of data.
$ pip install --user pygal The default is to fill from the x-axis up, but you can fill from any
Making a bar graph horizontal line using the zero argument.
Pygal on Windows A bar graph requires a list of values for the bar sizes. To label the
bars, pass a list of the same length to x_labels. chart = pygal.Line(fill=True, zero=0)
> python –m pip install --user pygal
import pygal

outcomes = [1, 2, 3, 4, 5, 6]
To make a plot with Pygal, you specify the kind of plot and frequencies = [18, 16, 18, 17, 18, 13]
then add the data.
chart = pygal.Bar()
Making a line graph
To view the output, open the file squares.svg in a browser.
chart.force_uri_protocol = 'http'
chart.x_labels = outcomes
import pygal chart.add('D6', frequencies)
chart.render_to_file('rolling_dice.svg')
x_values = [0, 1, 2, 3, 4, 5]
squares = [0, 1, 4, 9, 16, 25] Making a bar graph from a dictionary
Since each bar needs a label and a value, a dictionary is a great The documentation for Pygal is available at
way to store the data for a bar graph. The keys are used as the http://www.pygal.org/.
chart = pygal.Line() labels along the x-axis, and the values are used to determine the
chart.force_uri_protocol = 'http' height of each bar.
chart.add('x^2', squares)
chart.render_to_file('squares.svg') import pygal If you’re viewing svg output in a browser, Pygal needs to
render the output file in a specific way. The
Adding labels and a title results = { force_uri_protocol attribute for chart objects needs to
--snip-- 1:18, 2:16, 3:18, be set to 'http'.
chart = pygal.Line() 4:17, 5:18, 6:13,
chart.force_uri_protocol = 'http' }
chart.title = "Squares"
chart.x_labels = x_values chart = pygal.Bar()
chart.x_title = "Value" chart.force_uri_protocol = 'http' Covers Python 3 and Python 2
chart.y_title = "Square of Value" chart.x_labels = results.keys()
chart.add('x^2', squares) chart.add('D6', results.values())
chart.render_to_file('squares.svg') chart.render_to_file('rolling_dice.svg')
Pygal lets you customize many elements of a plot. There Pygal can generate world maps, and you can add any data
are some excellent default themes, and many options for Configuration settings you want to these maps. Data is indicated by coloring, by
Some settings are controlled by a Config object.
styling individual plot elements. labels, and by tooltips that show data when users hover
my_config = pygal.Config() over each country on the map.
Using built-in styles
my_config.show_y_guides = False
To use built-in styles, import the style and make an instance of the Installing the world map module
style class. Then pass the style object with the style argument my_config.width = 1000 The world map module is not included by default in Pygal 2.0. It
when you make the chart object. my_config.dots_size = 5 can be installed with pip:
import pygal chart = pygal.Line(config=my_config) $ pip install --user pygal_maps_world
from pygal.style import LightGreenStyle --snip-- Making a world map
x_values = list(range(11)) Styling series The following code makes a simple world map showing the
You can give each series on a chart different style settings. countries of North America.
squares = [x**2 for x in x_values]
cubes = [x**3 for x in x_values] chart.add('Squares', squares, dots_size=2) from pygal.maps.world import World
chart.add('Cubes', cubes, dots_size=3)
chart_style = LightGreenStyle() wm = World()
chart = pygal.Line(style=chart_style) Styling individual data points wm.force_uri_protocol = 'http'
chart.force_uri_protocol = 'http' You can style individual data points as well. To do so, write a wm.title = 'North America'
chart.title = "Squares and Cubes" dictionary for each data point you want to customize. A 'value' wm.add('North America', ['ca', 'mx', 'us'])
chart.x_labels = x_values key is required, and other properies are optional.
import pygal wm.render_to_file('north_america.svg')
chart.add('Squares', squares)
chart.add('Cubes', cubes) Showing all the country codes
repos = [ In order to make maps, you need to know Pygal’s country codes.
chart.render_to_file('squares_cubes.svg') { The following example will print an alphabetical list of each country
'value': 20506, and its code.
Parametric built-in styles
Some built-in styles accept a custom color, then generate a theme 'color': '#3333CC',
'xlink': 'http://djangoproject.com/', from pygal.maps.world import COUNTRIES
based on that color.
},
from pygal.style import LightenStyle 20054, for code in sorted(COUNTRIES.keys()):
12607, print(code, COUNTRIES[code])
--snip-- 11827,
chart_style = LightenStyle('#336688')
Plotting numerical data on a world map
] To plot numerical data on a map, pass a dictionary to add()
chart = pygal.Line(style=chart_style) instead of a list.
--snip-- chart = pygal.Bar()
chart.force_uri_protocol = 'http' from pygal.maps.world import World
Customizing individual style properties
Style objects have a number of properties you can set individually. chart.x_labels = [
populations = {
'django', 'requests', 'scikit-learn',
chart_style = LightenStyle('#336688') 'ca': 34126000,
'tornado',
chart_style.plot_background = '#CCCCCC' ] 'us': 309349000,
chart_style.major_label_font_size = 20 chart.y_title = 'Stars' 'mx': 113423000,
chart_style.label_font_size = 16 chart.add('Python Repos', repos) }
--snip-- chart.render_to_file('python_repos.svg')
wm = World()
Custom style class wm.force_uri_protocol = 'http'
You can start with a bare style class, and then set only the wm.title = 'Population of North America'
properties you care about. wm.add('North America', populations)
chart_style = Style()
chart_style.colors = [ wm.render_to_file('na_populations.svg')
'#CCCCCC', '#AAAAAA', '#888888']
chart_style.plot_background = '#EEEEEE'

chart = pygal.Line(style=chart_style) More cheat sheets available at


--snip--
The data in a Django project is structured as a set of Users interact with a project through web pages, and a
models. project’s home page can start out as a simple page with no
data. A page usually needs a URL, a view, and a template.
Defining a model
To define the models for your app, modify the file models.py that Mapping a project’s URLs
was created in your app’s folder. The __str__() method tells The project’s main urls.py file tells Django where to find the urls.py
Django how to represent data objects based on this model. files associated with each app in the project.
from django.db import models from django.conf.urls import include, url
from django.contrib import admin
Django is a web framework which helps you build class Topic(models.Model):
interactive websites using Python. With Django you """A topic the user is learning about.""" urlpatterns = [
define the kind of data your site needs to work with, text = models.CharField(max_length=200) url(r'^admin/', include(admin.site.urls)),
and you define the ways your users can work with date_added = models.DateTimeField( url(r'', include('learning_logs.urls',
that data. auto_now_add=True) namespace='learning_logs')),
]
def __str__(self):
return self.text Mapping an app’s URLs
It’s usualy best to install Django to a virtual environment, An app’s urls.py file tells Django which view to use for each URL in
where your project can be isolated from your other Python Activating a model the app. You’ll need to make this file yourself, and save it in the
projects. Most commands assume you’re working in an To use a model the app must be added to the tuple app’s folder.
active virtual environment. INSTALLED_APPS, which is stored in the project’s settings.py file.
from django.conf.urls import url
Create a virtual environment INSTALLED_APPS = (
--snip-- from . import views
$ python –m venv ll_env 'django.contrib.staticfiles',
Activate the environment (Linux and OS X) urlpatterns = [
# My apps url(r'^$', views.index, name='index'),
$ source ll_env/bin/activate 'learning_logs', ]
)
Activate the environment (Windows) Writing a simple view
Migrating the database A view takes information from a request and sends data to the
> ll_env\Scripts\activate browser, often through a template. View functions are stored in an
The database needs to be modified to store the kind of data that
the model represents. app’s views.py file. This simple view function doesn’t pull in any
Install Django to the active environment data, but it uses the template index.html to render the home page.
(ll_env)$ pip install Django $ python manage.py makemigrations learning_logs
$ python manage.py migrate from django.shortcuts import render

Creating a superuser def index(request):


A superuser is a user account that has access to all aspects of the """The home page for Learning Log."""
To start a project we’ll create a new project, create a project.
database, and start a development server. return render(request,
$ python manage.py createsuperuser 'learning_logs/index.html')
Create a new project
Registering a model
$ django-admin.py startproject learning_log . You can register your models with Django’s admin site, which
makes it easier to work with the data in your project. To do this,
Create a database modify the app’s admin.py file. View the admin site at The documentation for Django is available at
$ python manage.py migrate http://localhost:8000/admin/. http://docs.djangoproject.com/. The Django documentation
is thorough and user-friendly, so check it out!
View the project from django.contrib import admin
After issuing this command, you can view the project at
http://localhost:8000/. from learning_logs.models import Topic

$ python manage.py runserver admin.site.register(Topic) Covers Python 3 and Python 2


Create a new app
A Django project is made up of one or more apps.
$ python manage.py startapp learning_logs
A new model can use an existing model. The ForeignKey
Writing a simple template attribute establishes a connection between instances of the Using data in a template
A template sets up the structure for a page. It’s a mix of html and The data in the view function’s context dictionary is available
two related models. Make sure to migrate the database
template code, which is like Python but not as powerful. Make a within the template. This data is accessed using template
folder called templates inside the project folder. Inside the after adding a new model to your app. variables, which are indicated by doubled curly braces.
templates folder make another folder with the same name as the The vertical line after a template variable indicates a filter. In this
app. This is where the template files should be saved.
Defining a model with a foreign key
case a filter called date formats date objects, and the filter
class Entry(models.Model): linebreaks renders paragraphs properly on a web page.
<p>Learning Log</p>
"""Learning log entries for a topic.""" {% extends 'learning_logs/base.html' %}
topic = models.ForeignKey(Topic)
<p>Learning Log helps you keep track of your
text = models.TextField() {% block content %}
learning, for any topic you're learning
date_added = models.DateTimeField(
about.</p>
auto_now_add=True) <p>Topic: {{ topic }}</p>
def __str__(self): <p>Entries:</p>
Many elements of a web page are repeated on every page return self.text[:50] + "..." <ul>
in the site, or every page in a section of the site. By writing {% for entry in entries %}
one parent template for the site, and one for each section, <li>
you can easily modify the look and feel of your entire site. <p>
Most pages in a project need to present data that’s specific
The parent template to the current user. {{ entry.date_added|date:'M d, Y H:i' }}
The parent template defines the elements common to a set of </p>
pages, and defines blocks that will be filled by individual pages. URL parameters <p>
A URL often needs to accept a parameter telling it which data to {{ entry.text|linebreaks }}
<p> access from the database. The second URL pattern shown here </p>
<a href="{% url 'learning_logs:index' %}"> looks for the ID of a specific topic and stores it in the parameter </li>
Learning Log topic_id.
{% empty %}
</a> urlpatterns = [ <li>There are no entries yet.</li>
</p> url(r'^$', views.index, name='index'), {% endfor %}
url(r'^topics/(?P<topic_id>\d+)/$', </ul>
{% block content %}{% endblock content %} views.topic, name='topic'),
The child template ] {% endblock content %}
The child template uses the {% extends %} template tag to pull in
the structure of the parent template. It then defines the content for
Using data in a view
The view uses a parameter from the URL to pull the correct data
any blocks defined in the parent template.
from the database. In this example the view is sending a context
{% extends 'learning_logs/base.html' %} dictionary to the template, containing data that should be displayed You can explore the data in your project from the command
on the page. line. This is helpful for developing queries and testing code
{% block content %} def topic(request, topic_id): snippets.
<p> """Show a topic and all its entries.""" Start a shell session
Learning Log helps you keep track topic = Topics.objects.get(id=topic_id)
of your learning, for any topic you're entries = topic.entry_set.order_by( $ python manage.py shell
learning about. '-date_added')
</p> Access data from the project
context = {
{% endblock content %} 'topic': topic, >>> from learning_logs.models import Topic
'entries': entries, >>> Topic.objects.all()
} [<Topic: Chess>, <Topic: Rock Climbing>]
return render(request, >>> topic = Topic.objects.get(id=1)
'learning_logs/topic.html', context) >>> topic.text
'Chess'

Python code is usually indented by four spaces. In


templates you’ll often see two spaces used for indentation, If you make a change to your project and the change
doesn’t seem to have any effect, try restarting the server: More cheat sheets available at
because elements tend to be nested more deeply in
templates. $ python manage.py runserver
Defining the URLs Showing the current login status
Users will need to be able to log in, log out, and register. Make a You can modify the base.html template to show whether the user is
new urls.py file in the users app folder. The login view is a default currently logged in, and to provide a link to the login and logout
view provided by Django. pages. Django makes a user object available to every template,
and this template takes advantage of this object.
from django.conf.urls import url The user.is_authenticated tag allows you to serve specific
from django.contrib.auth.views import login content to users depending on whether they have logged in or not.
The {{ user.username }} property allows you to greet users
from . import views who have logged in. Users who haven’t logged in see links to
register or log in.
urlpatterns = [ <p>
url(r'^login/$', login, <a href="{% url 'learning_logs:index' %}">
Most web applications need to let users create {'template_name': 'users/login.html'}, Learning Log
accounts. This lets users create and work with their name='login'), </a>
own data. Some of this data may be private, and url(r'^logout/$', views.logout_view, {% if user.is_authenticated %}
name='logout'),
some may be public. Django’s forms allow users to Hello, {{ user.username }}.
url(r'^register/$', views.register, <a href="{% url 'users:logout' %}">
enter and modify their data. name='register'), log out
] </a>
The login template {% else %}
User accounts are handled by a dedicated app called <a href="{% url 'users:register' %}">
users. Users need to be able to register, log in, and log The login view is provided by default, but you need to provide your
own login template. The template shown here displays a simple register
out. Django automates much of this work for you. login form, and provides basic error messages. Make a templates </a> -
Making a users app folder in the users folder, and then make a users folder in the <a href="{% url 'users:login' %}">
templates folder. Save this file as login.html. log in
After making the app, be sure to add 'users' to INSTALLED_APPS
The tag {% csrf_token %} helps prevent a common type of </a>
in the project’s settings.py file.
attack with forms. The {{ form.as_p }} element displays the
{% endif %}
$ python manage.py startapp users default login form in paragraph format. The <input> element
named next redirects the user to the home page after a successful </p>
Including URLS for the users app login.
Add a line to the project’s urls.py file so the users app’s URLs are {% block content %}{% endblock content %}
included in the project. {% extends "learning_logs/base.html" %}
The logout view
urlpatterns = [ {% block content %} The logout_view() function uses Django’s logout() function
url(r'^admin/', include(admin.site.urls)), {% if form.errors %} and then redirects the user back to the home page. Since there is
url(r'^users/', include('users.urls', no logout page, there is no logout template. Make sure to write this
<p>
namespace='users')), code in the views.py file that’s stored in the users app folder.
Your username and password didn't match.
url(r'', include('learning_logs.urls', Please try again. from django.http import HttpResponseRedirect
namespace='learning_logs')), </p> from django.core.urlresolvers import reverse
] {% endif %} from django.contrib.auth import logout

<form method="post" def logout_view(request):


action="{% url 'users:login' %}"> """Log the user out."""
{% csrf token %} logout(request)
There are a number of ways to create forms and work with {{ form.as_p }} return HttpResponseRedirect(
them. You can use Django’s defaults, or completely <button name="submit">log in</button> reverse('learning_logs:index'))
customize your forms. For a simple way to let users enter
data based on your models, use a ModelForm. This creates <input type="hidden" name="next"
a form that allows users to enter data that will populate the value="{% url 'learning_logs:index' %}"/>
fields on a model. </form>
The register view on the back of this sheet shows a simple Covers Python 3 and Python 2
approach to form processing. If the view doesn’t receive {% endblock content %}
data from a form, it responds with a blank form. If it
receives POST data from a form, it validates the data and
then saves it to the database.
The register view The register template Restricting access to logged-in users
The register view needs to display a blank registration form when The register template displays the registration form in paragraph Some pages are only relevant to registered users. The views for
the page is first requested, and then process completed formats. these pages can be protected by the @login_required decorator.
registration forms. A successful registration logs the user in and Any view with this decorator will automatically redirect non-logged
redirects to the home page. {% extends 'learning_logs/base.html' %} in users to an appropriate page. Here’s an example views.py file.
from django.contrib.auth import login {% block content %} from django.contrib.auth.decorators import /
from django.contrib.auth import authenticate login_required
from django.contrib.auth.forms import \ <form method='post' --snip--
UserCreationForm action="{% url 'users:register' %}">
@login_required
def register(request): {% csrf_token %} def topic(request, topic_id):
"""Register a new user.""" {{ form.as_p }} """Show a topic and all its entries."""
if request.method != 'POST':
# Show blank registration form. Setting the redirect URL
<button name='submit'>register</button> The @login_required decorator sends unauthorized users to the
form = UserCreationForm() <input type='hidden' name='next' login page. Add the following line to your project’s settings.py file
else: value="{% url 'learning_logs:index' %}"/> so Django will know how to find your login page.
# Process completed form.
form = UserCreationForm( </form> LOGIN_URL = '/users/login/'
data=request.POST)
Preventing inadvertent access
{% endblock content %} Some pages serve data based on a parameter in the URL. You
if form.is_valid(): can check that the current user owns the requested data, and
new_user = form.save() return a 404 error if they don’t. Here’s an example view.
# Log in, redirect to home page.
pw = request.POST['password1'] Users will have data that belongs to them. Any model that should from django.http import Http404
authenticated_user = authenticate( be connected directly to a user needs a field connecting instances
username=new_user.username, of the model to a specific user. --snip--
password=pw def topic(request, topic_id):
Making a topic belong to a user """Show a topic and all its entries."""
) Only the highest-level data in a hierarchy needs to be directly
login(request, authenticated_user) connected to a user. To do this import the User model, and add it topic = Topics.objects.get(id=topic_id)
return HttpResponseRedirect( as a foreign key on the data model. if topic.owner != request.user:
reverse('learning_logs:index')) After modifying the model you’ll need to migrate the database. raise Http404
You’ll need to choose a user ID to connect each existing instance --snip--
to.
context = {'form': form}
return render(request, from django.db import models
'users/register.html', context) from django.contrib.auth.models import User
If you provide some initial data, Django generates a form
class Topic(models.Model): with the user’s existing data. Users can then modify and
"""A topic the user is learning about.""" save their data.
The django-bootstrap3 app allows you to use the Bootstrap text = models.CharField(max_length=200)
library to make your project look visually appealing. The Creating a form with initial data
date_added = models.DateTimeField(
app provides tags that you can use in your templates to The instance parameter allows you to specify initial data for a form.
auto_now_add=True)
style individual elements on a page. Learn more at owner = models.ForeignKey(User) form = EntryForm(instance=entry)
http://django-bootstrap3.readthedocs.io/.
def __str__(self): Modifying data before saving
The argument commit=False allows you to make changes before
return self.text writing data to the database.
Heroku lets you push your project to a live server, making it Querying data for the current user new_topic = form.save(commit=False)
available to anyone with an internet connection. Heroku In a view, the request object has a user attribute. You can use this new_topic.owner = request.user
offers a free service level, which lets you learn the attribute to query for the user’s data. The filter() function then
new_topic.save()
deployment process without any commitment. You’ll need pulls the data that belongs to the current user.
to install a set of heroku tools, and use git to track the state topics = Topic.objects.filter(
of your project. See http://devcenter.heroku.com/, and click
More cheat sheets available at
owner=request.user)
on the Python link.
SQL CHEAT SHEET http://www.sqltutorial.org
QUERYING DATA FROM A TABLE QUERYING FROM MULTIPLE TABLES USING SQL OPERATORS

SELECT c1, c2 FROM t; SELECT c1, c2 SELECT c1, c2 FROM t1


Query data in columns c1, c2 from a table FROM t1 UNION [ALL]
INNER JOIN t2 ON condition; SELECT c1, c2 FROM t2;
SELECT * FROM t; Inner join t1 and t2 Combine rows from two queries
Query all rows and columns from a table
SELECT c1, c2 SELECT c1, c2 FROM t1
SELECT c1, c2 FROM t FROM t1 INTERSECT
WHERE condition; LEFT JOIN t2 ON condition; SELECT c1, c2 FROM t2;
Query data and filter rows with a condition Left join t1 and t1 Return the intersection of two queries

SELECT DISTINCT c1 FROM t SELECT c1, c2


WHERE condition; FROM t1 SELECT c1, c2 FROM t1
Query distinct rows from a table RIGHT JOIN t2 ON condition; MINUS
Right join t1 and t2 SELECT c1, c2 FROM t2;
Subtract a result set from another result set
SELECT c1, c2 FROM t
ORDER BY c1 ASC [DESC]; SELECT c1, c2
Sort the result set in ascending or descending FROM t1 SELECT c1, c2 FROM t1
order FULL OUTER JOIN t2 ON condition; WHERE c1 [NOT] LIKE pattern;
Perform full outer join Query rows using pattern matching %, _
SELECT c1, c2 FROM t
ORDER BY c1 SELECT c1, c2
SELECT c1, c2 FROM t
LIMIT n OFFSET offset; FROM t1
WHERE c1 [NOT] IN value_list;
Skip offset of rows and return the next n rows CROSS JOIN t2;
Query rows in a list
Produce a Cartesian product of rows in tables
SELECT c1, aggregate(c2)
FROM t SELECT c1, c2 SELECT c1, c2 FROM t
GROUP BY c1; FROM t1, t2; WHERE c1 BETWEEN low AND high;
Group rows using an aggregate function Another way to perform cross join Query rows between two values

SELECT c1, aggregate(c2) SELECT c1, c2 SELECT c1, c2 FROM t


FROM t FROM t1 A WHERE c1 IS [NOT] NULL;
GROUP BY c1 INNER JOIN t2 B ON condition; Check if values in a table is NULL or not
HAVING condition; Join t1 to itself using INNER JOIN clause
Filter groups using HAVING clause
SQL CHEAT SHEET http://www.sqltutorial.org
MANAGING TABLES USING SQL CONSTRAINTS MODIFYING DATA

CREATE TABLE t ( CREATE TABLE t( INSERT INTO t(column_list)


id INT PRIMARY KEY, c1 INT, c2 INT, c3 VARCHAR, VALUES(value_list);
name VARCHAR NOT NULL, PRIMARY KEY (c1,c2) Insert one row into a table
price INT DEFAULT 0 );
); Set c1 and c2 as a primary key INSERT INTO t(column_list)
Create a new table with three columns
VALUES (value_list),
CREATE TABLE t1( (value_list), ….;
DROP TABLE t ; c1 INT PRIMARY KEY, Insert multiple rows into a table
Delete the table from the database c2 INT,
FOREIGN KEY (c2) REFERENCES t2(c2) INSERT INTO t1(column_list)
); SELECT column_list
ALTER TABLE t ADD column; Set c2 column as a foreign key FROM t2;
Add a new column to the table
Insert rows from t2 into t1
CREATE TABLE t(
ALTER TABLE t DROP COLUMN c ; c1 INT, c1 INT, UPDATE t
Drop column c from the table UNIQUE(c2,c3) SET c1 = new_value;
); Update new value in the column c1 for all rows
ALTER TABLE t ADD constraint; Make the values in c1 and c2 unique
Add a constraint UPDATE t
CREATE TABLE t( SET c1 = new_value,
c1 INT, c2 INT, c2 = new_value
ALTER TABLE t DROP constraint; WHERE condition;
Drop a constraint CHECK(c1> 0 AND c1 >= c2)
); Update values in the column c1, c2 that match
Ensure c1 > 0 and values in c1 >= c2 the condition

Rename a table from t1 to t2 DELETE FROM t;


CREATE TABLE t( Delete all data in a table
c1 INT PRIMARY KEY,
ALTER TABLE t1 RENAME c1 TO c2 ; c2 VARCHAR NOT NULL
Rename column c1 to c2 ); DELETE FROM t
Set values in c2 column not NULL WHERE condition;
Delete subset of rows in a table
TRUNCATE TABLE t;
Remove all data in a table
SQL CHEAT SHEET http://www.sqltutorial.org
MANAGING VIEWS MANAGING INDEXES MANAGING TRIGGERS

CREATE VIEW v(c1,c2) CREATE INDEX idx_name


CREATE OR MODIFY TRIGGER trigger_name
AS ON t(c1,c2);
WHEN EVENT
SELECT c1, c2 Create an index on c1 and c2 of the table t
ON table_name TRIGGER_TYPE
FROM t;
EXECUTE stored_procedure;
Create a new view that consists of c1 and c2
CREATE UNIQUE INDEX idx_name Create or modify a trigger
ON t(c3,c4);
CREATE VIEW v(c1,c2) Create a unique index on c3, c4 of the table t
WHEN
AS • BEFORE – invoke before the event occurs
SELECT c1, c2 • AFTER – invoke after the event occurs
FROM t; DROP INDEX idx_name;
WITH [CASCADED | LOCAL] CHECK OPTION; Drop an index
Create a new view with check option EVENT
• INSERT – invoke for INSERT
SQL AGGREGATE FUNCTIONS • UPDATE – invoke for UPDATE
CREATE RECURSIVE VIEW v • DELETE – invoke for DELETE
AS AVG returns the average of a list
select-statement -- anchor part
COUNT returns the number of elements of a list
UNION [ALL] TRIGGER_TYPE
select-statement; -- recursive part SUM returns the total of a list • FOR EACH ROW
Create a recursive view • FOR EACH STATEMENT
MAX returns the maximum value in a list

CREATE TEMPORARY VIEW v MIN returns the minimum value in a list CREATE TRIGGER before_insert_person
AS BEFORE INSERT
SELECT c1, c2 ON person FOR EACH ROW
FROM t; EXECUTE stored_procedure;
Create a temporary view Create a trigger invoked before a new row is
inserted into the person table

DROP VIEW view_name


Delete a view DROP TRIGGER trigger_name
Delete a specific trigger
CS 229 – Machine Learning https://stanford.edu/~shervine

Least squared Logistic Hinge Cross-entropy


VIP Cheatsheet: Supervised Learning 1  
(y − z)2 log(1 + exp(−yz)) max(0,1 − yz) − y log(z) + (1 − y) log(1 − z)
2

Afshine Amidi and Shervine Amidi

September 9, 2018

Linear regression Logistic regression SVM Neural Network


Introduction to Supervised Learning
r Cost function – The cost function J is commonly used to assess the performance of a model,
Given a set of data points {x(1) , ..., x(m) } associated to a set of outcomes {y (1) , ..., y (m) }, we and is defined with the loss function L as follows:
want to build a classifier that learns how to predict y from x.
m
r Type of prediction – The different types of predictive models are summed up in the table X
below: J(θ) = L(hθ (x(i) ), y (i) )
i=1

Regression Classifier
r Gradient descent – By noting α ∈ R the learning rate, the update rule for gradient descent
Outcome Continuous Class is expressed with the learning rate and the cost function J as follows:
Examples Linear regression Logistic regression, SVM, Naive Bayes θ ←− θ − α∇J(θ)

r Type of model – The different models are summed up in the table below:

Discriminative model Generative model


Goal Directly estimate P (y|x) Estimate P (x|y) to deduce P (y|x)
What’s learned Decision boundary Probability distributions of the data

Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training
example, and batch gradient descent is on a batch of training examples.
Illustration
r Likelihood – The likelihood of a model L(θ) given parameters θ is used to find the optimal
parameters θ through maximizing the likelihood. In practice, we use the log-likelihood `(θ) =
log(L(θ)) which is easier to optimize. We have:
θopt = arg max L(θ)
Examples Regressions, SVMs GDA, Naive Bayes θ

r Newton’s algorithm – The Newton’s algorithm is a numerical method that finds θ such
Notations and general concepts that `0 (θ) = 0. Its update rule is as follows:
`0 (θ)
r Hypothesis – The hypothesis is noted hθ and is the model that we choose. For a given input θ←θ−
`00 (θ)
data x(i) , the model prediction output is hθ (x(i) ).
Remark: the multidimensional generalization, also known as the Newton-Raphson method, has
r Loss function – A loss function is a function L : (z,y) ∈ R × Y 7−→ L(z,y) ∈ R that takes as the following update rule:
inputs the predicted value z corresponding to the real data value y and outputs how different −1
they are. The common loss functions are summed up in the table below: θ ← θ − ∇2θ `(θ) ∇θ `(θ)

Stanford University 1 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Linear regression Generalized Linear Models

We assume here that y|x; θ ∼ N (µ,σ 2 ) r Exponential family – A class of distributions is said to be in the exponential family if it can
be written in terms of a natural parameter, also called the canonical parameter or link function,
r Normal equations – By noting X the matrix design, the value of θ that minimizes the cost
η, a sufficient statistic T (y) and a log-partition function a(η) as follows:
function is a closed-form solution such that:

θ = (X T X)−1 X T y p(y; η) = b(y) exp(ηT (y) − a(η))

Remark: we will often have T (y) = y. Also, exp(−a(η)) can be seen as a normalization param-
r LMS algorithm – By noting α the learning rate, the update rule of the Least Mean Squares eter that will make sure that the probabilities sum to one.
(LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff
learning rule, is as follows: Here are the most common exponential distributions summed up in the following table:
m
X   (i)
∀j, θj ← θj + α y (i) − hθ (x(i) ) xj
i=1
Distribution η T (y) a(η) b(y)
φ

Bernoulli log 1−φ
y log(1 + exp(η)) 1
Remark: the update rule is a particular case of the gradient ascent.
 
η2 2
r LWR – Locally Weighted Regression, also known as LWR, is a variant of linear regression that Gaussian µ y 2
√1 exp − y2

weights each training example in its cost function by w(i) (x), which is defined with parameter
τ ∈ R as: 1
Poisson log(λ) y eη
  y!
(x(i) − x)2
w(i) (x) = exp − eη

2τ 2 Geometric log(1 − φ) y log 1−eη
1

Classification and logistic regression r Assumptions of GLMs – Generalized Linear Models (GLM) aim at predicting a random
variable y as a function fo x ∈ Rn+1 and rely on the following 3 assumptions:
r Sigmoid function – The sigmoid function g, also known as the logistic function, is defined
as follows:
(1) y|x; θ ∼ ExpFamily(η) (2) hθ (x) = E[y|x; θ] (3) η = θT x
1
∀z ∈ R, g(z) = ∈]0,1[
1 + e−z
Remark: ordinary least squares and logistic regression are special cases of generalized linear
models.
r Logistic regression – We assume here that y|x; θ ∼ Bernoulli(φ). We have the following
form:

1 Support Vector Machines


φ = p(y = 1|x; θ) = = g(θT x)
1 + exp(−θT x)
The goal of support vector machines is to find the line that maximizes the minimum distance to
the line.
Remark: there is no closed form solution for the case of logistic regressions.
r Optimal margin classifier – The optimal margin classifier h is such that:
r Softmax regression – A softmax regression, also called a multiclass logistic regression, is
used to generalize logistic regression when there are more than 2 outcome classes. By convention,
we set θK = 0, which makes the Bernoulli parameter φi of each class i equal to: h(x) = sign(wT x − b)

exp(θiT x)
φi = where (w, b) ∈ Rn × R is the solution of the following optimization problem:
K
X
exp(θjT x) 1
min ||w||2 such that y (i) (wT x(i) − b) > 1
j=1 2

Stanford University 2 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Gaussian Discriminant Analysis


r Setting – The Gaussian Discriminant Analysis assumes that y and x|y = 0 and x|y = 1 are
such that:

y ∼ Bernoulli(φ)

x|y = 0 ∼ N (µ0 ,Σ) and x|y = 1 ∼ N (µ1 ,Σ)

r Estimation – The following table sums up the estimates that we find when maximizing the
likelihood:

φ
b µbj (j = 0,1) Σ
b
Remark: the line is defined as wT x − b = 0 . m Pm m
1 X 1
i=1 {y (i) =j}
x(i) 1 X
1{y(i) =1} (x(i) − µy(i) )(x(i) − µy(i) )T
r Hinge loss – The hinge loss is used in the setting of SVMs and is defined as follows: m
1{y(i) =j}
P
m m
i=1 i=1 i=1
L(z,y) = [1 − yz]+ = max(0,1 − yz)

r Kernel – Given a feature mapping φ, we define the kernel K to be defined as: Naive Bayes
K(x,z) = φ(x)T φ(z)
r Assumption – The Naive Bayes model supposes that the features of each data point are all
  independent:
||x−z||2
In practice, the kernel K defined by K(x,z) = exp − 2σ2 is called the Gaussian kernel
n
and is commonly used.
Y
P (x|y) = P (x1 ,x2 ,...|y) = P (x1 |y)P (x2 |y)... = P (xi |y)
i=1

r Solutions – Maximizing the log-likelihood gives the following solutions, with k ∈ {0,1},
l ∈ [[1,L]]

(j)
1 #{j|y (j) = k and xi = l}
P (y = k) = × #{j|y (j) = k} and P (xi = l|y = k) =
m #{j|y (j) = k}

Remark: we say that we use the "kernel trick" to compute the cost function using the kernel Remark: Naive Bayes is widely used for text classification and spam detection.
because we actually don’t need to know the explicit mapping φ, which is often very complicated.
Instead, only the values K(x,z) are needed.
Tree-based and ensemble methods
r Lagrangian – We define the Lagrangian L(w,b) as follows:
l
X These methods can be used for both regression and classification problems.
L(w,b) = f (w) + βi hi (w) r CART – Classification and Regression Trees (CART), commonly known as decision trees,
i=1
can be represented as binary trees. They have the advantage to be very interpretable.

Remark: the coefficients βi are called the Lagrange multipliers. r Random forest – It is a tree-based technique that uses a high number of decision trees
built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly
uninterpretable but its generally good performance makes it a popular algorithm.
Generative Learning Remark: random forests are a type of ensemble methods.
A generative model first tries to learn how the data is generated by estimating P (x|y), which r Boosting – The idea of boosting methods is to combine several weak learners to form a
we can then use to estimate P (y|x) by using Bayes’ rule. stronger one. The main ones are summed up in the table below:

Stanford University 3 Fall 2018


CS 229 – Machine Learning https://stanford.edu/~shervine

Adaptive boosting Gradient boosting


m
- High weights are put on errors to - Weak learners trained 1 X
b(h) = 1{h(x(i) )6=y(i) }
improve at the next boosting step on remaining errors m
i=1
- Known as Adaboost

r Probably Approximately Correct (PAC) – PAC is a framework under which numerous


Other non-parametric approaches results on learning theory were proved, and has the following set of assumptions:

r k-nearest neighbors – The k-nearest neighbors algorithm, commonly known as k-NN, is a • the training and testing sets follow the same distribution
non-parametric approach where the response of a data point is determined by the nature of its
k neighbors from the training set. It can be used in both classification and regression settings. • the training examples are drawn independently
Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the
higher the variance.
r Shattering – Given a set S = {x(1) ,...,x(d) }, and a set of classifiers H, we say that H shatters
S if for any set of labels {y (1) , ..., y (d) }, we have:

∃h ∈ H, ∀i ∈ [[1,d]], h(x(i) ) = y (i)

r Upper bound theorem – Let H be a finite hypothesis class such that |H| = k and let δ and
the sample size m be fixed. Then, with probability of at least 1 − δ, we have:

 r
1 2k
  
(h) 6 min (h) + 2
b log
h∈H 2m δ

r VC dimension – The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis


class H, noted VC(H) is the size of the largest set that is shattered by H.
Remark: the VC dimension of H = {set of linear classifiers in 2 dimensions} is 3.
Learning Theory
r Union bound – Let A1 , ..., Ak be k events. We have:

P (A1 ∪ ... ∪ Ak ) 6 P (A1 ) + ... + P (Ak )


r Theorem (Vapnik) – Let H be given, with VC(H) = d and m the number of training
examples. With probability at least 1 − δ, we have:
r  
d m 1 1
   
h) 6
(b min (h) + O log + log
h∈H m d m δ

r Hoeffding inequality – Let Z1 , .., Zm be m iid variables drawn from a Bernoulli distribution
of parameter φ. Let φ
b be their sample mean and γ > 0 fixed. We have:

P (|φ − φ
b| > γ) 6 2 exp(−2γ 2 m)

Remark: this inequality is also known as the Chernoff bound.

r Training error – For a given classifier h, we define the training error b


(h), also known as the
empirical risk or empirical error, to be as follows:

Stanford University 4 Fall 2018

You might also like