Professional Documents
Culture Documents
Best Practice: use predefined roles when they exist (over Summary: AppEngine is the PaaS option- serverless
primitive). Follow the principle of least privileged favors. IaaS, Paas, SaaS and ops free. ComputeEngine is the IaaS option- fully
IaaS: gives you the infrastructure pieces (VMs) but you controllable down to OS level. Kubernetes Engine is in
have to maintain/join together the different infrastructure the middle- clusters of machines running Kuberenetes and
Stackdriver pieces for your application to work. Most flexible option. hosting containers.
GCP’s monitoring, logging, and diagnostics solution. Pro- PaaS: gives you all the infrastructure pieces already
vides insights to health, performance, and availability of joined so you just have to deploy source code on the
applications. platform for your application to work. PaaS solutions are
Main Functions managed services/no-ops (highly available/reliable) and
• Debugger: inspect state of app in real time without serverless/autoscaling (elastic). Less flexible than IaaS
stopping/slowing down e.g. code behavior
• Error Reporting: counts, analyzes, aggregates Fully Managed, Hotspotting Additional Notes
crashes in cloud services You can also mix and match multiple compute options.
• Monitoring: overview of performance, uptime and Preemptible Instances: instances that run at a much
heath of cloud services (metrics, events, metadata) lower price but may be terminated at any time, self-
• Alerting: create policies to notify you when health terminate after 24 hours. ideal for interruptible workloads
and uptime check results exceed a certain limit Snapshots: used for backups of disks
• Tracing: tracks how requests propagate through Images: VM OS (Ubuntu, CentOS)
applications/receive near real-time performance re-
sults, latency reports of VMs
• Logging: store, search, monitor and analyze log
data and events from GCP
Storage BigTable BigTable Part II
Persistent Disk Fully-managed block storage (SSDs) Columnar database ideal applications that need high Designing Your Schema
that is suitable for VMs/containers. Good for snapshots throughput, low latencies, and scalability (IoT, user analy- Designing a Bigtable schema is different than designing a
of data backups/sharing read-only data across VMs. tics, time-series data, graph data) for non-structured schema for a RDMS. Important considerations:
key/value data (each value is < 10 MB). A single value • Each table has only one index, the row key (4 KB)
Cloud Storage Infinitely scalable, fully-managed and in each row is indexed and this value is known as the row • Rows are sorted lexicographically by row key.
highly reliable object/blob storage. Good for data blobs: key. Does not support SQL queries. Zonal service. Not • All operations are atomic (ACID) at row level.
images, pictures, videos. Cannot query by content. good for data less than 1TB of data or items greater than • Both reads and writes should be distributed evenly
10MB. Ideal at handling large amounts of data (TB/PB) • Try to keep all info for an entity in a single row
To use Cloud Storage, you create buckets to store data and for long periods of time. • Related entities should be stored in adjacent rows
the location can be specified. Bucket names are globally • Try to store 10 MB in a single cell (max 100 MB)
unique. There a 4 storage classes: Data Model and 100 MB in a single row (256 MB)
• Multi-Regional: frequent access from anywhere in 4-Dimensional: row key, column family (table name), co- • Supports max of 1,000 tables in each instance.
the world. Use for ”hot data” lumn name, timestamp. • Choose row keys that don’t follow predictable order
• Regional: high local performance for region • Row key uniquely identifies an entity and columns • Can use up to around 100 column families
• Nearline: storage for data accessed less than once contain individual values for each row. • Column Qualifiers: can create as many as you need
a month (archival) • Similar columns are grouped into column families. in each row, but should avoid splitting data across
• Coldline: less than once a year (archival) • Each column is identified by a combination of the more column qualifiers than necessary (16 KB)
column family and a column qualifier, which is a • Tables are sparse. Empty columns don’t take up
CloudSQL and Cloud Spanner- Relational DBs unique name within the column family. any space. Can create large number of columns, even
if most columns are empty in most rows.
Cloud SQL • Field promotion (shift a column as part of the row
Fully-managed relational database service (supports key) and salting (remainder of division of hash of
MySQL/PostgreSQL). Use for relational data: tables, timestamp plus row key) are ways to help design
rows and columns, and super structured data. SQL row keys.
compatible and can update fields. Not scalable (small
storage- GBs). Good for web frameworks and OLTP wor- For time-series data, use tall/narrow tables.
kloads (not OLAP). Can use Cloud Storage Transfer Denormalize- prefer multiple tall and narrow tables
Service or Transfer Appliance to data into Cloud
Storage (from AWS, local, another bucket). Use gsutil if
copying files over from on-premise.
Load Balancing Automatically manages splitting,
merging, and rebalancing. The master process balances
Cloud Spanner
workload/data volume within clusters. The master
Google-proprietary offering, more advanced than Cloud
splits busier/larger tablets in half and merges less-
SQL. Mission-critical, relational database. Supports ho-
accessed/smaller tablets together, redistributing them
rizontal scaling. Combines benefits of of relational and
between nodes.
non-relational databases.
Best write performance can be achieved by using row
• Ideal: relational, structured, and semi-structured
keys that do not follow a predictable order and grouping
data that requires high availability, strong consis-
related rows so they are adjacent to one another, which re-
tency, and transactional reads and writes
sults in more efficient multiple row reads at the same time.
• Avoid: data is not relational or structured, want
an open source RDBMS, strong consistency and
Security Security can be managed at the project and
high availability is unnecessary
instance level. Does not support table-level, row-level,
column-level, or cell-level security restrictions.
Cloud Spanner Data Model
A database can contain 1+ tables. Tables look like relati-
Other Storage Options
onal database tables. Data is strongly typed: must define
- Need SQL for OLTP: CloudSQL/Cloud Spanner.
a schema for each database and that schema must specify
- Need interactive querying for OLAP: BigQuery.
the data types of each column of each table.
- Need to store blobs larger than 10 MB: Cloud Storage.
Parent-Child Relationships: Can optionally define re-
- Need to store structured objects in a document database,
lationships between tables to physically co-locate their
with ACID and SQL-like queries: Cloud Datastore.
rows for efficient retrieval (data locality: physically sto-
ring 1+ rows of a table with a row from another table.
BigQuery BigQuery Part II Optimizing BigQuery
Scalable, fully-managed Data Warehouse with extremely Partitioned tables Controlling Costs
fast SQL queries. Allows querying for massive volumes Special tables that are divided into partitions based on - Avoid SELECT * (full scan), select only columns needed
of data at fast speeds. Good for OLAP workloads a column or partition key. Data is stored on different (SELECT * EXCEPT)
(petabyte-scale), Big Data exploration and processing, directories and specific queries will only run on slices of - Sample data using preview options for free
and reporting via Business Intelligence (BI) tools. Sup- data, which improves query performance and reduces - Preview queries to estimate costs (dryrun)
ports SQL querying for non-relational data. Relatively costs. Note that the partitions will not be of the same - Use max bytes billed to limit query costs
cheap to store, but costly for querying/processing. Good size. BigQuery automatically does this. - Don’t use LIMIT clause to limit costs (still full scan)
for analyzing historical data. Each partitioned table can have up to 2,500 partitions - Monitor costs using dashboards and audit logs
(2500 days or a few years). The daily limit is 2,000 - Partition data by date
Data Model partition updates per table, per day. The rate limit: 50 - Break query results into stages
Data tables are organized into units called datasets, which partition updates every 10 seconds. - Use default table expiration to delete unneeded data
are sets of tables and views. A table must belong to da- - Use streaming inserts wisely
taset and a datatset must belong to a porject. Tables Two types of partitioned tables: - Set hard limit on bytes (members) processed per day
contain records with rows and columns (fields). • Ingestion Time: Tables partitioned based on
You can load data into BigQuery via two options: batch the data’s ingestion (load) date or arrival date. Query Performance
loading (free) and streaming (costly). Each partitioned table will have pseudocolumn Generally, queries that do less work perform better.
PARTITIONTIME, or time data was loaded into
Security table. Pseudocolumns are reserved for the table and Input Data/Data Sources
BigQuery uses IAM to manage access to resources. The cannot be used by the user. - Avoid SELECT *
three types of resources in BigQuery are organizations, • Partitioned Tables: Tables that are partitioned - Prune partitioned queries (for time-partitioned table,
projects, and datasets. Security can be applied at the based on a TIMESTAMP or DATE column. use PARTITIONTIME pseudo column to filter partitions)
project and dataset level, but not table or view level. - Denormalize data (use nested and repeated fields)
Windowing: window functions increase the efficiency - Use external data sources appropriately
Views and reduce the complexity of queries that analyze - Avoid excessive wildcard tables
A view is a virtual table defined by a SQL query. When partitions (windows) of a dataset by providing complex
you create a view, you query it in the same way you operations without the need for many intermediate SQL Anti-Patterns
query a table. Authorized views allow you to calculations. They reduce the need for intermediate - Avoid self-joins., use window functions (perform calcu-
share query results with particular users/groups tables to store temporary data lations across many table rows related to current row)
without giving them access to underlying data. - Partition/Skew: avoid unequally sized partitions, or
When a user queries the view, the query results contain Bucketing when a value occurs more often than any other value
data only from the tables and fields specified in the query Like partitioning, but each split/partition should be the - Cross-Join: avoid joins that generate more outputs than
that defines the view. same size and is based on the hash function of a column. inputs (pre-aggregate data or use window function)
Each bucket is a separate file, which makes for more Update/Insert Single Row/Column: avoid point-specific
Billing efficient sampling and joining data. DML, instead batch updates and inserts
Billing is based on storage (amount of data stored),
querying (amount of data/number of bytes processed by Querying Managing Query Outputs
query), and streaming inserts. Storage options are ac- After loading data into BigQuery, you can query using - Avoid repeated joins and using the same subqueries
tive and long-term (modified or not past 90 days). Query Standard SQL (preferred) or Legacy SQL (old). Query - Writing large sets has performance/cost impacts. Use
options are on-demand and flat-rate. jobs are actions executed asynchronously to load, export, filters or LIMIT clause. 128MB limit for cached results
Query costs are based on how much data you read/process, query, or copy data. Results can be saved to permanent - Use LIMIT clause for large sorts (Resources Exceeded)
so if you only read a section of a table (partition), (store) or temporary (cache) tables. 2 types of queries:
your costs will be reduced. Any charges occurred • Interactive: query is executed immediately, counts Optimizing Query Computation
are billed to the attached billing account. Expor- toward daily/concurrent usage (default) - Avoid repeatedly transforming data via SQL queries
ting/importing/copying data is free. • Batch: batches of queries are queued and the query - Avoid JavaScript user-defined functions
starts when idle resources are available, only counts - Use approximate aggregation functions (approx count)
for daily and switches to interactive if idle for 24 - Order query operations to maximize performance. Use
hours ORDER BY only in outermost query, push complex ope-
rations to end of the query.
Wildcard Tables - For queries that join data from multiple tables, optimize
Used if you want to union all similar tables with similar join patterns. Start with the largest table.
names. ’*’ (e.g. project.dataset.Table*)
DataStore DataFlow Pub/Sub
NoSQL document database that automatically handles Managed service for developing and executing data Asynchronous messaging service that decouples senders
sharding and replication (highly available, scalable and processing patterns (ETL) (based on Apache Beam) for and receivers. Allows for secure and highly available com-
durable). Supports ACID transactions, SQL-like queries. streaming and batch data. Preferred for new Hadoop or munication between independently written applications.
Query execution depends on size of returned result, not Spark infrastructure development. Usually site between A publisher app creates and sends messages to a to-
size of dataset. Ideal for ”needle in a haystack”operation front-end and back-end storage solutions. pic. Subscriber applications create a subscription to a
and applications that rely on highly available structured topic to receive messages from it. Communication can
data at scale Concepts be one-to-many (fan-out), many-to-one (fan-in), and
Pipeline: encapsulates series of computations that many-to-many. Gaurantees at least once delivery before
Data Model accepts input data from external sources, transforms data deletion from queue.
Data objects in Datastore are known as entities. An to provide some useful intelligence, and produce output
entity has one or more named properties, each of which PCollections: abstraction that represents a potentially Scenarios
can have one or more values. Each entity in has a key distributed, multi-element data set, that acts as the • Balancing workloads in network clusters- queue can
that uniquely identifies it. You can fetch an individual pipeline’s data. PCollection objects represent input, efficiently distribute tasks
entity using the entity’s key, or query one or more entities intermediate, and output data. The edges of the pipeline. • Implementing asynchronous workflows
based on the entities’ keys or property values. Transforms: operations in pipeline. A transform • Data streaming from various processes or devices
takes a PCollection(s) as input, performs an operation • Reliability improvement- in case zone failure
Ideal for highly structured data at scale: product that you specify on each element in that collection, • Distributing event notifications
catalogs, customer experience based on users past acti- and produces a new output PCollection. Uses the • Refreshing distributed caches
vities/preferences, game states. Don’t use if you need ”what/where/when/how”model. Nodes in the pipeline. • Logging to multiple systems
extremely low latency or analytics (complex joins, etc). Composite transforms are multiple transforms: combi-
ning, mapping, shuffling, reducing, or statistical analysis. Benefits/Features
Pipeline I/O: the source/sink, where the data flows Unified messaging, global presence, push- and pull-style
in and out. Supports read and write transforms for a subscriptions, replicated storage and guaranteed at-least-
number of common data storage types, as well as custom. once message delivery, encryption of data at rest/transit,
easy-to-use REST/JSON API
Data Model
Topic, Subscription, Message (combination of data
and attributes that a publisher sends to a topic and is
eventually delivered to subscribers), Message Attribute
(key-value pair that a publisher can define for a message)
DataProc
Message Flow
Fully-managed cloud service for running Spark and • Publisher creates a topic in the Cloud Pub/Sub ser-
Hadoop clusters. Provides access to Hadoop cluster on vice and sends messages to the topic.
GCP and Hadoop-ecosystem tools (Pig, Hive, and Spark). • Messages are persisted in a message store until they
Can be used to implement ETL warehouse solution. are delivered and acknowledged by subscribers.
• The Pub/Sub service forwards messages from a topic
Preferred if migrating existing on-premise Hadoop or to all of its subscriptions, individually. Each subs-
Spark infrastructure to GCP without redevelopment ef- cription receives messages either by pushing/pulling.
fort. Dataflow is preferred for a new development. Windowing • The subscriber receives pending messages from its
Windowing a PCollection divides the elements into win- subscription and acknowledges message.
dows based on the associated event time for each element. • When a message is acknowledged by the subscriber,
Especially useful for PCollections with unbounded size, it is removed from the subscription’s message queue.
since it allows operating on a sub-group (mini-batches).
Triggers
Allows specifying a trigger to control when (in processing
time) results for the given window can be produced. If
unspecified, the default behavior is to trigger first when
the watermark passes the end of the window, and then
trigger again every time there is late arriving data.
ML Engine ML Concepts/Terminology TensorFlow
Managed infrastructure of GCP with the power and Features: input data used by the ML model Tensorflow is an open source software library for numeri-
flexibility of TensorFlow. Can use it to train ML models Feature Engineering: transforming input features to cal computation using data flow graphs. Everything in
at scale and host trained models to make predictions be more useful for the models. e.g. mapping categories to TF is a graph, where nodes represent operations on data
about new data in the cloud. Supported frameworks buckets, normalizing between -1 and 1, removing null and edges represent the data. Phase 1 of TF is building
include Tensorflow, scikit-learn and XGBoost. Train/Eval/Test: training is data used to optimize the up a computation graph and phase 2 is executing it. It is
model, evaluation is used to asses the model on new data also distributed, meaning it can run on either a cluster of
ML Workflow during training, test is used to provide the final result machines or just a single machine.
Evaluate Problem: What is the problem and is ML the Classification/Regression: regression is prediction a
best approach? How will you measure model’s success? number (e.g. housing price), classification is prediction Tensors
Choosing Development Environment: Supports from a set of categories(e.g. predicting red/blue/green) In a graph, tensors are the edges and are multidimensional
Python 2 and 3 and supports TF, scikit-learn, XGBoost Linear Regression: predicts an output by multiplying data arrays that flow through the graph. Central unit
as frameworks. and summing input features with weights and biases of data in TF and consists of a set of primitive values
Data Preparation and Exploration: Involves gathe- Logistic Regression: similar to linear regression but shaped into an array of any number of dimensions.
ring, cleaning, transforming, exploring, splitting, and predicts a probability
preprocessing data. Also includes feature engineering. Neural Network: composed of neurons (simple building A tensor is characterized by its rank (# dimensions
Model Training/Testing: Provide access to train/test blocks that actually “learn”), contains activation functi- in tensor), shape (# of dimensions and size of each di-
data and train them in batches. Evaluate progress/results ons that makes it possible to predict non-linear outputs mension), data type (data type of each element in tensor).
and adjust the model as needed. Export/save trained Activation Functions: mathematical functions that in-
model (250 MB or smaller to deploy in ML Engine). troduce non-linearity to a network e.g. RELU, tanh Placeholders and Variables
Hyperparameter Tuning: hyperparameters are varia- Sigmoid Function: function that maps very negative Variables: best way to represent shared, persistent state
bles that govern the training process itself, not related to numbers to a number very close to 0, huge numbers close manipulated by your program. These are the parameters
the training data itself. Usually constant during training. to 1, and 0 to .5. Useful for predicting probabilities of the ML model are altered/trained during the training
Prediction: host your trained ML models in the cloud Gradient Descent/Backpropagation: fundamental process. Training variables.
and use the Cloud ML prediction service to infer target loss optimizer algorithms, of which the other optimizers Placeholders: way to specify inputs into a graph that
values for new data are usually based. Backpropagation is similar to gradient hold the place for a Tensor that will be fed at runtime.
descent but for neural nets They are assigned once, do not change after. Input nodes
ML APIs Optimizer: operation that changes the weights and bia-
Speech-to-Text: speech-to-text conversion ses to reduce loss e.g. Adagrad or Adam Popular Architectures
Text-to-Speech: text-to-speech conversion Weights / Biases: weights are values that the input fea- Linear Classifier: takes input features and combines
Translation: dynamically translate between languages tures are multiplied by to predict an output value. Biases them with weights and biases to predict output value
Vision: derive insight (objects/text) from images are the value of the output given a weight of 0. DNNClassifier: deep neural net, contains intermediate
Natural Language: extract information (sentiment, Converge: algorithm that converges will eventually re- layers of nodes that represent “hidden features” and
intent, entity, and syntax) about the text: people, places.. ach an optimal answer, even if very slowly. An algorithm activation functions to represent non-linearity
Video Intelligence: extract metadata from videos that doesn’t converge may never reach an optimal answer. ConvNets: convolutional neural nets. popular for image
Learning Rate: rate at which optimizers change weights classification.
Cloud Datalab and biases. High learning rate generally trains faster but Transfer Learning: use existing trained models as
Interactive tool (run on an instance) to explore, analyze, risks not converging, whereas a lower rate trains slower starting points and add additional layers for the specific
transform and visualize data and build machine learning Overfitting: model performs great on the input data but use case. idea is that highly trained existing models know
models. Built on Jupyter. Datalab is free but may incus poorly on the test data (combat by dropout, early stop- general features that serve as a good starting point for
costs based on usages of other services. ping, or reduce # of nodes or layers) training a small network on specific examples
Bias/Variance: how much output is determined by the RNN: recurrent neural nets, designed for handling a
Data Studio features. more variance often can mean overfitting, more sequence of inputs that have “memory” of the sequence.
Turns your data into informative dashboards and reports. bias can mean a bad model LSTMs are a fancy version of RNNs, popular for NLP
Updates to data will automatiaclly update in dashbaord. Regularization: variety of approaches to reduce over- GAN: general adversarial neural net, one model creates
Query cache remembers the queries (requests for data) fitting, including adding the weights to the loss function, fake examples, and another model is served both fake
issued by the components in a report (lightning bolt)- randomly dropping layers (dropout) example and real examples and is asked to distinguish
turn off for data that changes frequently, want to prio- Ensemble Learning: training multiple models with dif- Wide and Deep: combines linear classifiers with
ritize freshness over performance, or using a data source ferent parameters to solve the same problem deep neural net classifiers, ”wide”linear parts represent
that incurs usage costs (e.g. BigQuery). Prefetch cache Embeddings: mapping from discrete objects, such as memorizing specific examples and “deep” parts represent
predicts data that could be requested by analyzing the di- words, to vectors of real numbers. useful because classifi- understanding high level features
mensions, metrics, filters, and date range properties and ers/neural networks work well on vectors of real numbers
controls on the report.
Case Study: Flowlogistic I Case Study: Flowlogistic II Flowlogistic Potential Solution
Company Overview Existing Technical Environment 1. Cloud Dataproc handles the existing workloads and
Flowlogistic is a top logistics and supply chain provider. Flowlogistic architecture resides in a single data center: produces results as before (using a VPN).
They help businesses throughout the world manage their • Databases: 2. At the same time, data is received through the
resources and transport them to their final destination. – 8 physical servers in 2 clusters data center, via either stream or batch, and sent
The company has grown rapidly, expanding their offerings ∗ SQL Server: inventory, user/static data to Pub/Sub.
to include rail, truck, aircraft, and oceanic shipping. – 3 physical servers 3. Pub/Sub encrypts the data in transit and at rest.
∗ Cassandra: metadata, tracking messages 4. Data is fed into Dataflow either as stream/batch
Company Background – 10 Kafka servers: tracking message aggregation data.
The company started as a regional trucking company, and batch insert 5. Dataflow processes the data and sends the cleaned
and then expanded into other logistics markets. Because • Application servers: customer front end, middleware data to BigQuery (again either as stream or batch).
they have not updated their infrastructure, managing and for order/customs 6. Data can then be queried from BigQuery and pre-
tracking orders and shipments has become a bottleneck. – 60 virtual machines across 20 physical servers dictive analysis can begin (using ML Engine, etc..)
To improve operations, Flowlogistic developed proprie- ∗ Tomcat: Java services
tary technology for tracking shipments in real time at ∗ Nginx: static content
the parcel level. However, they are unable to deploy it ∗ Batch servers
because their technology stack, based on Apache Kafka, • Storage appliances
cannot support the processing volume. In addition, – iSCSI for virtual machine (VM) hosts
Flowlogistic wants to further analyze their orders and – Fibre Channel storage area network (FC SAN):
shipments to determine how best to deploy their resources. SQL server storage
– Network-attached storage (NAS): image sto-
Solution Concept rage, logs, backups Note: All services are fully managed., easily scalable, and
Flowlogistic wants to implement two concepts in the cloud: • 10 Apache Hadoop / Spark Servers can handle streaming/batch data. All technical require-
• Use their proprietary technology in a real-time – Core Data Lake ments are met.
inventory-tracking system that indicates the loca- – Data analysis workloads
tion of their loads. • 20 miscellaneous servers
• Perform analytics on all their orders and shipment – Jenkins, monitoring, bastion hosts, security
logs (structured and unstructured data) to deter- scanners, billing software
mine how best to deploy resources, which customers
to target, and which markets to expand into. They CEO Statement
also want to use predictive analytics to learn earlier We have grown so quickly that our inability to up-
when a shipment will be delayed. grade our infrastructure is really hampering further
growth/efficiency. We are efficient at moving shipments
Business Requirements around the world, but we are inefficient at moving data
-Build a reliable and reproducible environment with around. We need to organize our information so we can
scaled parity of production more easily understand where our customers are and
-Aggregate data in a centralized Data Lake for analysis what they are shipping.
-Perform predictive analytics on future shipments
-Accurately track every shipment worldwide CTO Statement
-Improve business agility and speed of innovation through IT has never been a priority for us, so as our data
rapid provisioning of new resources has grown, we have not invested enough in our tech-
-Analyze/optimize architecture cloud performance nology. I have a good staff to manage IT, but they
-Migrate fully to the cloud if all other requirements met are so busy managing our infrastructure that I cannot
get them to do the things that really matter, such as
Technical Requirements organizing our data, building the analytics, and figu-
-Handle both streaming and batch data ring out how to implement the CFO’s tracking technology.
-Migrate existing Hadoop workloads
-Ensure architecture is scalable and elastic to meet the CFO Statement
changing demands of the company Part of our competitive advantage is that we penalize our-
-Use managed services whenever possible selves for late shipments/deliveries. Knowing where our
-Encrypt data in flight and at rest shipments are at all times has a direct correlation to our
-Connect a VPN between the production data center and bottom line and profitability. Additionally, I don’t want
cloud environment to commit capital to building out a server environment.
Case Study: MJTelco I Case Study: MJTelco II Choosing a Storage Option
Company Overview Technical Requirements Need Open Source GCP Solution
MJTelco is a startup that plans to build networks -Ensure secure and efficient transport and storage of Compute, Persistent Disks, Persistent Disks,
in rapidly growing, underserved markets around the telemetry data. Block Storage SSD SSD
world. The company has patents for innovative optical -Rapidly scale instances to support between 10,000 and Media, Blob Filesystem, Cloud Storage
communications hardware. Based on these patents, they 100,000 data providers with multiple flows each. Storage HDFS
can create many reliable, high-speed backbone links with -Allow analysis and presentation against data tables SQL Interface Hive BigQuery
inexpensive hardware. tracking up to 2 years of data storing approximately on File Data
100m records/day. Support rapid iteration of monitoring Document CouchDB, Mon- DataStore
Company Background infrastructure focused on awareness of data pipeline DB, NoSQL goDB
Founded by experienced telecom executives, MJTelco uses problems both in telemetry flows and in production Fast Scanning HBase, Cassan- BigTable
technologies originally developed to overcome communica- learning cycles. NoSQL dra
tions challenges in space. Fundamental to their operation, OLTP RDBMS- CloudSQL,
they need to create a distributed data infrastructure that CEO Statement MySQL Cloud Spanner
drives real-time analysis and incorporates machine lear- Our business model relies on our patents, analytics and OLAP Hive BigQuery
ning to continuously optimize their topologies. Because dynamic machine learning. Our inexpensive hardware
Cloud Storage: unstructured data (blob)
their hardware is inexpensive, they plan to overdeploy is organized to be highly reliable, which gives us cost
CloudSQL: OLTP, SQL, structured and relational data,
the network allowing them to account for the impact of advantages. We need to quickly stabilize our large
no need for horizontal scaling
dynamic regional politics on location availability and cost. distributed data pipelines to meet our reliability and
Cloud Spanner: OLTP, SQL, structured and relational
capacity commitments.
data, need for horizontal scaling, between RDBMS/big
Their management and operations teams are situated
data
all around the globe creating many-to-many relationship CTO Statement
Cloud Datastore: NoSQL, document data, key-value
between data consumers and providers in their system. Our public cloud services must operate as advertised. We
structured but non-relational data (XML, HTML, query
After careful consideration, they decided public cloud is need resources that scale and keep our data secure. We
depends on size of result (not dataset), fast to read/slow
the perfect environment to support their needs. also need environments in which our data scientists can
to write
carefully study and quickly adapt our models. Because
BigTable: NoSQL, key-value data, columnar, good for
Solution Concept we rely on automation to process our data, we also need
sparse data, sensitive to hot spotting, high throughput
MJTelco is running a successful proof-of-concept (PoC) our development and test environments to work as we
and scalability for non-structured key/value data, where
project in its labs. They have two primary needs: iterate.
each value is typically no larger than 10 MB
• Scale and harden their PoC to support significantly
more data flows generated when they ramp to more CFO Statement
than 50,000 installations. This project is too large for us to maintain the hardware
• Refine their machine-learning cycles to verify and and software required for the data and analysis. Also,
improve the dynamic models they use to control we cannot afford to staff an operations team to monitor
topology definitions. MJTelco will also use three se- so many data feeds, so we will rely on automation and
parate operating environments – development/test, infrastructure. Google Cloud’s machine learning will al-
staging, and production – to meet the needs of low our quantitative researchers to work on our high-value
running experiments, deploying new features, and problems instead of problems with our data pipelines.
serving production customers.
Business Requirements
-Scale up their production environment with minimal cost,
instantiating resources when and where needed in an un-
predictable, distributed telecom user community.
-Ensure security of their proprietary data to protect their
leading-edge machine learning and analysis.
-Provide reliable and timely access to data for analysis
from distributed research workers.
-Maintain isolated environments that support rapid ite-
ration of their machine-learning models without affecting
their customers.
Solutions Solutions- Part II Solutions- Part III
Data Lifecycle Storage Process and Analyze
At each stage, GCP offers multiple services to manage Cloud Storage: durable and highly-available object sto- In order to derive business value and insights from data,
your data. rage for structured and unstructured data you must transform and analyze it. This requires a proces-
1. Ingest: first stage is to pull in the raw data, such Cloud SQL: fully managed, cloud RDBMS that offers sing framework that can either analyze the data directly
as streaming data from devices, on-premises batch both MySQL and PostgreSQL engines with built-in sup- or prepare the data for downstream analysis, as well as
data, application logs, or mobile-app user events and port for replication, for low-latency, transactional, relati- tools to analyze and understand processing results.
analytics onal database workloads. supports RDBMS workloads up • Processing: data from source systems is cleansed,
2. Store: after the data has been retrieved, it needs to 10 TB (storing financial transactions, user credentials, normalized, and processed across multiple machines,
to be stored in a format that is durable and can be customer orders) and stored in analytical systems
easily accessed BigTable: managed, high-performance NoSQL database • Analysis: processed data is stored in systems that
3. Process and Analyze: the data is transformed service designed for terabyte- to petabyte-scale workloads. allow for ad-hoc querying and exploration
from raw form into actionable information suitable for large-scale, high-throughput workloads such • Understanding: based on analytical results, data
4. Explore and Visualize: convert the results of the as advertising technology or IoT data infrastructure. does is used to train and test automated machine-learning
analysis into a format that is easy to draw insights not support multi-row transactions, SQL queries or joins, models
from and to share with colleagues and peers consider Cloud SQL or Cloud Datastore instead
Cloud Spanner: fully managed relational database ser-
vice for mission-critical OLTP applications. horizontally
scalable, and built for strong consistency, high availability,
and global scale. supports schemas, ACID transactions,
and SQL queries (use for retail and global supply chain,
ad tech, financial services)
BigQuery: stores large quantities of data for query and
analysis instead of transactional processing Processing
• Cloud Dataproc: migrate your existing Hadoop or
Spark deployments to a fully-managed service that
automates cluster creation, simplifies configuration
and management of your cluster, has built-in moni-
toring and utilization reports, and can be shutdown
when not in use
• Cloud Dataflow: designed to simplify big data for
both streaming and batch workloads, focus on filte-
ring, aggregating, and transforming your data
• Cloud Dataprep: service for visually exploring,
Exploration and Visualization Data exploration and cleaning, and preparing data for analysis. can
Ingest visualization to better understand the results of the pro- transform data of any size stored in CSV, JSON, or
There are a number of approaches to collect raw data, cessing and analysis. relational-table formats
based on the data’s size, source, and latency. • Cloud Datalab: interactive web-based tool that
• Application: data from application events, log fi- you can use to explore, analyze and visualize data Analyzing and Querying
les or user events, typically collected in a push mo- built on top of Jupyter notebooks. runs on a VM • BigQuery: query using SQL, all data encrypted,
del, where the application calls an API to send and automatically saved to persistent disks and can user analysis, device and operational metrics, busi-
the data to storage (Stackdriver Logging, Pub/Sub, be stored in GC Source Repo (git repo) ness intelligence
CloudSQL, Datastore, Bigtable, Spanner) • Data Studio: drag-and-drop report builder that • Task-Specific ML: Vision, Speech, Natural Lan-
• Streaming: data consists of a continuous stream you can use to visualize data into reports and dash- guage, Translation, Video Intelligence
of small, asynchronous messages. Common uses in- boards that can then be shared with others, bac- • ML Engine: managed platform you can use to run
clude telemetry, or collecting data from geographi- ked by live data, that can be shared and updated custom machine learning models at scale
cally dispersed devices (IoT) and user events and easily. data sources can be data files, Google She-
analytics (Pub/Sub) ets, Cloud SQL, and BigQuery. supports query and
• Batch: large amounts of data are stored in a set of streaming data: use Pub/Sub and Dataflow in combination
prefetch cache: query remembers previous queries
files that are transferred to storage in bulk. com- and if data not found, goes to prefetch cache, which
mon use cases include scientific workloads, backups, predicts data that could be requested (disable if data
migration (Storage, Transfer Service, Appliance) changes frequently/using data source incurs charges)
CME 106 – Introduction to Probability and Statistics for Engineers https://stanford.edu/~shervine
X −µ
h i
any known ∼ N (0,1) X − zα √σ ,X + zα √σ
√σ 2 n 2 n
n
Shervine Amidi Xi ∼ N (µ, σ)
X −µ
h i
small unknown ∼ tn−1 X − tα √s ,X + tα √s
√s 2 n 2 n
August 13, 2018 n
X −µ
h i
known ∼ N (0,1) X − zα √σ ,X + zα √σ
√σ 2 n 2 n
n
Xi ∼ any large
Parameter estimation X −µ
h i
unknown ∼ N (0,1) X − zα √s ,X + zα √s
√s 2 n 2 n
r Random sample – A random sample is a collection of n random variables X1 , ..., Xn that n
are independent and identically distributed with X.
Xi ∼ any small any Go home! Go home!
r Estimator – An estimator θ̂ is a function of the data that is used to infer the value of an
unknown parameter θ in a statistical model.
r Confidence interval for the variance – The single-line table below sums up the test
r Bias – The bias of an estimator θ̂ is defined as being the difference between the expected statistic to compute when determining the confidence interval for the variance.
value of the distribution of θ̂ and the true value, i.e.:
Distribution Sample size µ Statistic 1 − α confidence interval
Bias(θ̂) = E[θ̂] − θ
s2 (n − 1)
h i
s2 (n−1) s2 (n−1)
Xi ∼ N (µ,σ) any any ∼ χ2n−1 χ2
, χ2
σ2 2 1
Remark: an estimator is said to be unbiased when we have E[θ̂] = θ.
r Sample mean and variance – The sample mean and the sample variance of a random Hypothesis testing
sample are used to estimate the true mean µ and the true variance σ 2 of a distribution, are
noted X and s2 respectively, and are such that: r Errors – In a hypothesis test, we note α and β the type I and type II errors respectively. By
noting T the test statistic and R the rejection region, we have:
n n
1 X 1 X α = P (T ∈ R|H0 true) and / R|H1 true)
β = P (T ∈
X= Xi and s2 = σ̂ 2 = (Xi − X)2
n n−1
i=1 i=1 r p-value – In a hypothesis test, the p-value is the probability under the null hypothesis of
having a test statistic T at least as extreme as the one that we observed T0 . In particular, when
T follows a zero-centered symmetric distribution under H0 , we have:
r Central Limit Theorem – Let us have a random sample X1 , ..., Xn following a given
distribution with mean µ and variance σ 2 , then we have: Case Left-sided Right-sided Two-sided
p-value P (T 6 T0 |H0 true) P (T > T0 |H0 true) P (|T | > |T0 ||H0 true)
σ
X ∼ N µ, √
n→+∞ n
r Sign test – The sign test is a non-parametric test used to determine whether the median of
a sample is equal to the hypothesized median. By noting V the number of samples falling to
the right of the hypothesized median, we have:
Confidence intervals
r Confidence level – A confidence interval CI1−α with confidence level 1 − α of a true pa- Statistic when np < 5 Statistic when np > 5
rameter θ is such that 1 − α of the time, the true value is contained in the confidence interval: V − n
1 2
V ∼ B n, p = 2
Z= √ ∼ N (0,1)
H0 n H0
P (θ ∈ CI1−α ) = 1 − α 2
r Confidence interval for the mean – When determining a confidence interval for the mean r Testing for the difference in two means – The table below sums up the test statistic to
µ, different test statistics have to be computed depending on which case we are in. The following compute when performing a hypothesis test where the null hypothesis is:
table sums it up: H0 : µX − µY = δ
Distribution of Xi , Yi nX , nY 2 , σ2
σX Statistic r Sum of squared errors – By keeping the same notations, we define the sum of squared
Y
errors, also known as SSE, as follows:
(X − Y ) − δ n n
any known q ∼ N (0,1) X X
2
σX 2
σY H0 SSE = (Yi − Ŷi )2 = (Yi − (A + Bxi ))2 = SY Y − BSXY
nX
+ nY
i=1 i=1
(X − Y ) − δ
Normal large unknown q ∼ N (0,1) r Least-squares estimates – When estimating the coefficients α, β with the least-squares
s2
X
s2
Y
H0 method which is done by minimizing the SSE, we obtain the estimates A, B defined as follows:
nX
+ nY
SXY SXY
A=Y − x and B=
(X − Y ) − δ SXX SXX
small unknown σX = σY q ∼ tnX +nY −2
1 1 H0
s nX
+ nY r Key results – When σ is unknown, this parameter is estimated by the unbiased estimator
s2 defined as follows:
D−δ
Normal, paired any unknown ∼ tn−1 SY Y − BSXY s2 (n − 2)
s
√D H0 s2 = and we have ∼ χ2n−2
n n−2 σ2
Di = Xi − Yi nX = nY
The table below sums up the properties surrounding the least-squares estimates A, B when
σ is known or not:
r χ2 goodness of fit test – By noting k the number of bins, n the total number of samples,
pi the probability of success in each bin and Yi the associated number of samples, we can use
the test statistic T defined below to test whether or not there is a good fit. If npi > 5, we have: Coeff σ Statistic 1 − α confidence interval
q q
A−α 1 X
2
1 X
2
k
known q ∼ N (0,1) A − zα σ n
+ SXX
,A + z α σ n
+ SXX
2 2 2
X (Yi − npi )2 σ 1
+ X
T = ∼ χ2df with df = (k − 1) − #(estimated parameters) n SXX
npi H0 α
i=1 q q
2 2
unknown q A−α ∼ tn−2 A − tα s
2
1
n
+ X
SXX
,A + t α s
2
1
n
+ X
SXX
2
s 1+ X
r Test for arbitrary trends – Given a sequence, the test for arbitrary trends is a non- n SXX
parametric test, whose aim is to determine whether the data suggest the presence of an increasing
trend: B−β
h
σ σ
i
known √ σ ∼ N (0,1) B − zα √ ,B + z α √
H0 : no trend versus H1 : there is an increasing trend 2 SXX 2 SXX
SXX
β
If we note x the number of transpositions in the sequence, the p-value is computed as: B−β
h i
s s
unknown √ s ∼ tn−2 B − tα √ ,B + t α √
p-value = P (T 6 x) 2 SXX 2 SXX
SXX
The parameter k provides a way to tradeoff between the 1. Choose a K. Randomly assign a number between 1
largest and the total dimensional difference. In other and K to each observation. These serve as initial
words, larger values of k place more emphasis on large cluster assignments
differences between feature values than smaller values. Se- 2. Iterate until cluster assignments stop changing
lecting the right k can significantly impact the the mean- (a) For each of the K clusters, compute the cluster
ingfulness of your distance function. The most popular centroid. The kth cluster centroid is the vector
values are 1 and 2. of the p feature means for the observations in
• Manhattan (k=1): city block distance, or the sum the kth cluster.
KNN Algorithm (b) Assign each observation to the cluster whose
of the absolute difference between two points
1. Compute distance D(a,b) from point b to all points centroid is closest (where closest is defined us-
• Euclidean (k=2): straight line distance
2. Select k closest points and their labels ing distance metric).
3. Output class with most frequent labels in k points Since the results of the algorithm depends on the initial
Optimizing KNN random assignments, it is a good idea to repeat the
Comparing a query point a in d dimensions against n train- algorithm from different random initializations to obtain
ing examples computes with a runtime of O(nd), which the best overall results. Can use MSE to determine which
can cause lag as points reach millions or billions. Popular cluster assignment is better.
choices to speed up KNN include:
• Vernoi Diagrams: partitioning plane into regions
qP
Weighted Minkowski: dk (a, b) = k d Hierarchical Clustering
i=1 wi |ai − bi | , in
k
based on distance to points in a specific subset of Alternative clustering algorithm that does not require us
some scenarios, not all dimensions are equal. Can convey the plane to commit to a particular K. Another advantage is that it
this idea using wi . Generally not a good idea- should • Grid Indexes: carve up space into d-dimensional results in a nice visualization called a dendrogram. Ob-
normalize data by Z-scores before computing distances. boxes or grids and calculate the NN in the same cell servations that fuse at bottom are similar, where those at
a·b as the point the top are quite different- we draw conclusions based on
Cosine Similarity: cos(a, b) = |a||b| , calculates the • Locality Sensitive Hashing (LSH): abandons the location on the vertical rather than horizontal axis.
similarity between 2 non-zero vectors, where a · b is the the idea of finding the exact nearest neighbors. In- 1. Begin with n observations and a measure of all the
dot product (normalized between 0 and 1), higher values stead, batch up nearby points to quickly find the (n)n−1
2
pairwise dissimilarities. Treat each observa-
imply more similar vectors most appropriate bucket B for our query point. LSH tion as its own cluster.
is defined by a hash function h(p) that takes a 2. For i = n, n-1, ...2
Kullback-Leibler Divergence: KL(A||B) = di=i ai log2 abii
P
point/vector as input and produces a number/ code (a) Examine all pairwise inter-cluster dissimilari-
KL divergence measures the distances between probabil- as output, such that it is likely that h(a) = h(b) if ties among the i clusters and identify the pair
ity distributions by measuring the uncertainty gained or a and b are close to each other, and h(a)!= h(b) if of clusters that are least dissimilar ( most simi-
uncertainty lost when replacing distribution A with dis- they are far apart. lar). Fuse these two clusters. The dissimilarity
tribution B. However, this is not a metric but forms the
between these two clusters indicates height in
basis for the Jensen-Shannon Divergence Metric.
dendrogram where fusion should be placed.
Jensen-Shannon: JS(A, B) = 12 KL(A||M )+ 21 KL(M ||B),
(b) Assign each observation to the cluster whose
where M is the average of A and B. The JS function is the
centroid is closest (where closest is defined us-
right metric for calculating distances between probability
ing distance metric).
distributions
Linkage: Complete (max dissimilarity), Single (min), Av-
erage, Centroid (between centroids of cluster A and B)
Machine Learning Part I Machine Learning Part II Machine Learning Part III
Comparing ML Algorithms Decision Trees Support Vector Machines
Power and Expressibility: ML methods differ in terms Binary branching structure used to classify an arbitrary Work by constructing a hyperplane that separates
of complexity. Linear regression fits linear functions while input vector X. Each node in the tree contains a sim- points between two classes. The hyperplane is de-
NN define piecewise-linear separation boundaries. More ple feature comparison against some field (xi > 42?). termined using the maximal margin hyperplane, which
complex models can provide more accurate models, but Result of each comparison is either true or false, which is the hyperplane that is the maximum distance from
at the risk of overfitting. determines if we should proceed along to the left or the training observations. This distance is called
Interpretability: some models are more transparent right child of the given node. Also known as some- the margin. Points that fall on one side of the
and understandable than others (white box vs. black box times called classification and regression trees (CART). hyperplane are classified as -1 and the other +1.
models)
Ease of Use: some models feature few parame-
ters/decisions (linear regression/NN), while others
require more decision making to optimize (SVMs)
Training Speed: models differ in how fast they fit the
necessary parameters
Prediction Speed: models differ in how fast they make
predictions given a query
Advantages: Non-linearity, support for categorical
variables, easy to interpret, application to regression. Principal Component Analysis (PCA)
Disadvantages: Prone to overfitting, instable (not Principal components allow us to summarize a set of
robust to noise), high variance, low bias correlated variables with a smaller set of variables that
collectively explain most of the variability in the original
Note: rarely do models just use one decision tree. set. Essentially, we are ”dropping” the least important
Instead, we aggregate many decision trees using methods feature variables.
like ensembling, bagging, and boosting.
Naive Bayes Principal Component Analysis is the process by
Naive Bayes methods are a set of supervised learning Ensembles, Bagging, Random Forests, Boosting which principal components are calculated and the use
algorithms based on applying Bayes’ theorem with the Ensemble learning is the strategy of combining many of them to analyzing and understanding the data. PCA
”naive” assumption of independence between every pair different classifiers/models into one predictive model. It is an unsupervised approach and is used for dimensional-
of features. revolves around the idea of voting: a so-called ”wisdom of ity reduction, feature extraction, and data visualization.
crowds” approach. The most predicted class will be the Variables after performing PCA are independent. Scal-
Problem: Suppose we need to classify vector X = x1 ...xn final prediction. ing variables is also important while performing PCA.
into m classes, C1 ...Cm . We need to compute the proba- Bagging: ensemble method that works by taking B boot-
bility of each possible class given X, so we can assign X strapped subsamples of the training data and constructing
the label of the class with highest probability. We can B trees, each tree training on a distinct subsample as
calculate a probability using the Bayes’ Theorem: Random Forests: builds on bagging by decorrelating
P (X|Ci )P (Ci ) the trees. We do everything the same like in bagging, but
P (Ci |X) = when we build the trees, everytime we consider a split, a
P (X)
random sample of the p predictors is chosen as split can-
Where: √
didates, not the full set (typically m ≈ p). When m =
1. P (Ci ): the prior probability of belonging to class i p, then we are just doing bagging.
2. P (X): normalizing constant, or probability of seeing Boosting: the main idea is to improve our model where
the given input vector over all possible input vectors it is not performing well by using information from previ-
3. P (X|Ci ): the conditional probability of seeing ously constructed classifiers. Slow learner. Has 3 tuning
input vector X given we know the class is Ci parameters: number of classifiers B, learning parameter λ,
interaction depth d (controls interaction order of model).
The prediction model will formally look like:
P (X|Ci )P (Ci )
C(X) = argmaxi∈classes(t) P (X)
Create a new local repository Switch HEAD branch Rebase your current HEAD onto <branch>
$ git init $ git checkout <branch> Don‘t rebase published commits!
$ git rebase <branch>
Create a new branch based
LOCAL CHANGES on your current HEAD Abort a rebase
$ git branch <new-branch> $ git rebase --abort
Changed files in your working directory
$ git status Create a new tracking branch based on Continue a rebase after resolving conflicts
a remote branch $ git rebase --continue
Changes to tracked files
$ git checkout --track <remote/bran-
$ git diff ch> Use your configured merge tool to
solve conflicts
Add all current changes to the next commit Delete a local branch
$ git mergetool
$ git add . $ git branch -d <branch>
Use your editor to manually solve conflicts
Add some changes in <file> to the next commit Mark the current commit with a tag and ( after resolving) mark file as resolved
$ git add -p <file> $ git tag <tag-name> $ git add <resolved-file>
Commit all local changes in tracked files $ git rm <resolved-file>
$ git commit -a UPDATE & PUBLISH
Commit previously staged changes List all currently configured remotes UNDO
$ git commit $ git remote -v
Discard all local changes in your working
Change the last commit Show information about a remote directory
Don‘t amend published commits! $ git remote show <remote> $ git reset --hard HEAD
$ git commit --amend
Add new remote repository, named <remote> Discard local changes in a specific file
$ git remote add <shortname> <url> $ git checkout HEAD <file>
COMMIT HISTORY
Download all changes from <remote>, Revert a commit (by producing a new commit
Show all commits, starting with newest but don‘t integrate into HEAD with contrary changes)
$ git log $ git fetch <remote> $ git revert <commit>
Show changes over time for a specific file Download changes and directly Reset your HEAD pointer to a previous commit
$ git log -p <file> merge/integrate into HEAD …and discard all changes since then
$ git pull <remote> <branch> $ git reset --hard <commit>
Who changed what and when in <file>
$ git blame <file> Publish local changes on a remote …and preserve all changes as unstaged
$ git push <remote> <branch> changes
$ git reset <commit>
Delete a branch on the remote
$ git branch -dr <remote/branch> …and preserve uncommitted local changes
$ git reset --keep <commit>
Publish your tags
$ git push --tags
VERSION CONTROL
BEST PRACTICES
COMMIT RELATED CHANGES TEST CODE BEFORE YOU COMMIT USE BRANCHES
A commit should be a wrapper for related Resist the temptation to commit some- Branching is one of Git‘s most powerful
changes. For example, fixing two different thing that you «think» is completed. Test it features - and this is not by accident: quick
bugs should produce two separate commits. thoroughly to make sure it really is completed and easy branching was a central requirement
Small commits make it easier for other de- and has no side effects (as far as one can tell). from day one. Branches are the perfect tool
velopers to understand the changes and roll
While committing half-baked things in your to help you avoid mixing up different lines
them back if something went wrong.
local repository only requires you to forgive of development. You should use branches
With tools like the staging area and the abi-
yourself, having your code tested is even more extensively in your development workflows:
lity to stage only parts of a file, Git makes it
easy to create very granular commits. important when it comes to pushing/sharing for new features, bug fixes, ideas…
your code with others.
Underfitting refers to poor inductive learning from training - Average over 10 or more updated to observe the learning
data and poor generalization. trend while using SGD
General
Overfitting refers to learning the training data detail and Ordinary Least Squares
Definition noise which leads to poor generalization. It can be limited
by using resampling and defining a validation dataset. OLS is used to find the estimator 𝛽 that minimizes the sum
We want to learn a target function f that maps input of squared residuals: E?CD(𝑦? − 𝛽@ − B9CD 𝛽9 𝑥?9 )F = 𝑦 − 𝑋 𝛽
variables X to output variable Y, with an error e:
𝑌 = 𝑓 𝑋 + 𝑒 Optimization
Linear, Nonlinear Almost every machine learning method has an optimization
Different algorithms make different assumptions about the algorithm at its core.
shape and structure of f, thus the need of testing several Gradient Descent
methods. Any algorithm can be either:
Gradient Descent is used to find the coefficients of f that
- Parametric (or Linear): simplify the mapping to a known minimizes a cost function (for example MSE, SSR). Using linear algebra such that we have 𝛽 = (𝑋 G 𝑋)HD 𝑋 G 𝑦
linear combination form and learning its coefficients.
Procedure: Maximum Likelihood Estimation
- Non parametric (or Nonlinear): free to learn any
functional form from the training data, while maintaining à Initialization 𝜃 = 0 (coefficients to 0 or random) MLE is used to find the estimators that minimizes the
some ability to generalize. likelihood function:
à Calculate cost 𝐽(𝜃) = 𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒(𝑓 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 )
Linear algorithms are usually simpler, faster and requires 7 ℒ 𝜃 𝑥 = 𝑓8 (𝑥) density function of the data distribution
less data, while Nonlinear can be are more flexible, more à Gradient of cost 𝐽(𝜃) we know the uphill direction
789
powerful and more performant. 7
à Update coeff 𝜃𝑗 = 𝜃𝑗 − 𝛼 𝐽(𝜃) we go downhill
Supervised, Unsupervised
789 Linear Algorithms
The cost updating process is repeated until convergence
Supervised learning methods learn to predict Y from X All linear Algorithms assume a linear relationship between
(minimum found).
given that the data is labeled. the input variables X and the output variable Y.
It is the go-to for binary classification. For multiclass classification, LDA is the preferred linear
technique.
Representation:
Representation:
Logistic regression a linear method but predictions are
transformed using the logistic function (or sigmoid): LDA representation consists of statistical properties
calculated for each class: means and the covariance matrix:
D E F D E F
Learning: 𝜇Z =
E[
?CD 𝑥? and 𝜎 =
EH] ?CD(𝑥? − 𝜇Z )
Learning a LR means estimating the coefficients from the
training data. Common methods include Gradient Descent
or Ordinary Least Squares.
Variations:
There are extensions of LR training called regularization
methods, that aim to reduce the complexity of the model:
𝜙 is S-shaped and map real-valued number in (0,1).
- Lasso Regression: where OLS is modified to minimize the
sum of the coefficients (L1 regularization) The representation is an equation with binary output:
E B B B
𝑒 QR SQT UT S⋯SQV UV LDA assumes Gaussian data and attributes of same 𝝈𝟐 .
(𝑦? − 𝛽@ − 𝛽9 𝑥?9 )F + 𝜆 |𝛽9 | = 𝑅𝑆𝑆 + 𝜆 |𝛽9 | 𝑦 =
1 + 𝑒 QR SQT UT S⋯SQV UV Predictions are made using Bayes Theorem:
?CD 9CD 9CD 9CD
where 𝜆 ≥ 0 is a tuning parameter to be determined. Learning the Logistic regression coefficients is done using 𝜇Z 𝜇Z F
maximum-likelihood estimation, to predict values close to 𝐷Z 𝑥 = 𝑥 × − + ln (𝑃 𝑘 )
Data preparation: 𝜎 F 2𝜎 F
1 for default class and close to 0 for the other class.
The class with largest discriminant value is the output
- Transform data for linear relationship (ex: log transform Data preparation: class.
…for exponential relationship)
- Probability transformation to binary for classification Variations:
- Remove noise such as outliers
- Remove noise such as outliers - Quadratic DA: Each class uses its own variance estimate
- Rescale inputs using standardization or normalization
Advantages: - Regularized DA: Regularization into the variance estimate
Advantages:
+ Good classification baseline considering simplicity Data preparation:
+ Good regression baseline considering simplicity
+ Possibility to change cutoff for precision/recall tradeoff - Review and modify univariate distributions to be Gaussian
+ Lasso/Ridge can be used to avoid overfitting
+ Robust to noise/overfitting with L1/L2 regularization - Standardize data to 𝜇 = 0, 𝜎 = 1 to have same variance
+ Lasso/Ridge permit feature selection in case of collinearity
+ Probability output can be used for ranking - Remove noise such as outliers
Usecase examples:
Usecase examples: Advantages:
- Product sales prediction according to prices or promotions
- Customer scoring with probability of purchase + Can be used for dimensionality reduction by keeping the
-.Call-center waiting-time prediction according to the …latent variables as new variables
…number of complaints and the number of working agents - Classification of loan defaults according to profile
E
Usecase example: Variations:
𝐺= 𝑝Z (1 − 𝑝Z )
- Prediction of customer churn ?CD
Gaussian Naive Bayes can extend to numerical attributes
by assuming a Gaussian distribution.
The Gini index is an indication of how pure are the leaves,
if all observations are the same type G=0 (perfect purity), Instead of 𝑃(𝑥|ℎ) are calculated with 𝑃(ℎ) during learning:
Nonlinear Algorithms while a 50-50 split for binary would be G=0.5 (worst purity).
D E D E
𝜇 𝑥 = ?CD 𝑥? and 𝜎 = ?CD(𝑥? − 𝜇 𝑥 )F
All Nonlinear Algorithms are non-parametric and more The most common Stopping Criterion for splitting is a E E
flexible. They are not sensible to outliers and do not require minimum of training observations per node.
and MAP for prediction is calculated using Gaussian PDF
any shape of distribution.
The simplest form of pruning is Reduced Error Pruning: (UH~)•
1 H
Classification and Regression Trees Starting at the leaves, each node is replaced with its most 𝑓 𝑥 𝜇 𝑥 ,𝜎 = 𝑒 F€ •
popular class. If the prediction accuracy is not affected, then 2𝜋𝜎
Also referred as CART or Decision Trees, this algorithm is the
the change is kept Data preparation:
foundation of Random Forest and Boosted Trees.
Advantages: - Change numerical inputs to categorical (binning) or near-
Representation:
…Gaussian inputs (remove outliers, log & boxcox transform)
+ Easy to interpret and no overfitting with pruning
The model representation is a binary tree, where each
- Other distributions can be used instead of Gaussian
node is an input variable x with a split point and each leaf + Works for both regression and classification problems
contain an output variable y for prediction. - Log-transform of the probabilities can avoid overflow
+ Can take any type of variables without modifications, and
…do not require any data preparation - Probabilities can be updated as data becomes available
Usecase examples: Advantages:
- Fraudulent transaction classification + Fast because of the calculations
- Predict human resource allocation in companies + If the naive assumptions works can converge quicker than
…other models. Can be used on smaller training data.
Naive Bayes Classifier
+ Good for few categories variables
Naive Bayes is a classification algorithm interested in
selecting the best hypothesis h given data d assuming there Usecase examples:
is no interaction between features.
The model actually split the input space into (hyper) - Article classification using binary word presence
rectangles, and predictions are made according to the area Representation:
- Email spam detection using a similar technique
observations fall into. The representation is the based on Bayes Theorem:
K-Nearest Neighbors
Learning: 𝑃(𝑑|ℎ)×𝑃(ℎ)
𝑃 ℎ𝑑 = If you are similar to your neighbors, you are one of them.
Learning of a CART is done by a greedy approach called 𝑃(𝑑)
recursive binary splitting of the input space: with naïve hypothesis 𝑃 ℎ 𝑑 = 𝑃 𝑥D ℎ × …×𝑃(𝑥? |ℎ) Representation:
At each step, the best predictor 𝑋9 and the best cutpoint s KNN uses the entire training set, no training is required.
The prediction is the Maximum A posteriori Hypothesis:
are selected such that 𝑋 𝑋9 < 𝑠 and 𝑋 𝑋9 ≥ 𝑠 Predictions are made by searching the k similar instances,
minimizes the cost. 𝑀𝐴𝑃 ℎ = max 𝑃 ℎ 𝑑 = max (𝑃 𝑑 ℎ ×𝑃 ℎ )
according to a distance, and summarizing the output.
- For regression the cost is the Sum of Squared Error: The denominator is not kept as it is only for normalization.
E F Learning:
(𝑦? − 𝑦) Training is fast because only probabilities need to be
?CD calculated:
- For classification the cost function is the Gini index: ?ErstEuvrw uxyEs(U ∧ {)
𝑃 ℎ = and 𝑃 𝑥 ℎ =
tcc ?ErstEuvr ?ErstEuvrw
For regression the output can be the mean, while for Ensemble Algorithms
classification the output can be the most common class.
Ensemble methods use multiple, simpler algorithms
Various distances can be used, for example:
combined to obtain better performance.
- Euclidean Distance, good for similar type of variables
Bagging and Random Forest
E
Random Forest is part of a bigger type of ensemble
𝑑 𝑎, 𝑏 = (𝑎? − 𝑏? )F
?CD methods called Bootstrap Aggregation or Bagging. Bagging
can reduce the variance of high-variance models.
- Manhattan Distance, good for different type of variables
It uses the Bootstrap statistical procedure: estimate a
E The prediction function is the signed distance of the new
quantity from a sample by creating many random
𝑑 𝑎, 𝑏 = |𝑎? − 𝑏? | input x to the separating hyperplane w:
subsamples with replacement, and computing the mean of
?CD each subsample.
𝑓 𝑥 =< 𝑤, 𝑥 > + 𝜌 = 𝑤 G 𝑥 + 𝜌 with 𝜌 the bias
The best value of k must be found by testing, and the
algorithm is sensible to the Curse of dimensionality. Which gives for linear kernel, with 𝑥? the support vectors:
E
Data preparation: 𝑓 𝑥 = 𝑎? ×(𝑥×𝑥? )) + 𝜌
?CD
- Rescale inputs using standardization or normalization
Learning:
- Address missing data for distance calculations
The hyperplane learning is done by transforming the
- Dimensionality reduction or feature selection for COD problem using linear algebra, and minimizing: Representation:
E
Advantages: 1 For bagged decision trees, the steps would be:
max 0,1 − 𝑦? 𝑤. 𝑥† − 𝑏 + 𝜆||𝑤||F
+ Effective if the training data is large 𝑛 - Create many subsamples of the training dataset
?CD
make_pizza()
make_pizza('pepperoni')
If you had infinite programming skills, what would you Simple is better than complex
Returning a value build?
If you have a choice between a simple and a complex
def add_numbers(x, y): As you're learning to program, it's helpful to think solution, and both work, use the simple solution. Your
"""Add two numbers and return the sum.""" about the real-world projects you'd like to create. It's
return x + y code will be easier to maintain, and it will be easier
a good habit to keep an "ideas" notebook that you for you and others to build on that code later on.
can refer to whenever you want to start a new project.
sum = add_numbers(3, 5)
print(sum)
If you haven't done so already, take a few minutes
and describe three projects you'd like to create. More cheat sheets available at
You can add elements to the end of a list, or you can insert The sort() method changes the order of a list permanently.
them wherever you like in a list. The sorted() function returns a copy of the list, leaving the
original list unchanged. You can sort the items in a list in
Adding an element to the end of the list alphabetical order, or reverse alphabetical order. You can
users.append('amy') also reverse the original order of the list. Keep in mind that
lowercase and uppercase letters may affect the sort order.
Starting with an empty list
Sorting a list permanently
users = []
A list stores a series of items in a particular order. users.append('val') users.sort()
Lists allow you to store sets of information in one users.append('bob') Sorting a list permanently in reverse alphabetical
place, whether you have just a few items or millions users.append('mia')
order
of items. Lists are one of Python's most powerful Inserting elements at a particular position
features readily accessible to new programmers, and users.sort(reverse=True)
they tie together many important concepts in users.insert(0, 'joe')
Sorting a list temporarily
programming. users.insert(3, 'bea')
print(sorted(users))
print(sorted(users, reverse=True))
You can remove elements by their position in a list, or by Reversing the order of a list
Use square brackets to define a list, and use commas to
separate individual items in the list. Use plural names for the value of the item. If you remove an item by its value, users.reverse()
lists, to make your code easier to read. Python removes only the first item that has that value.
users = ['val', 'bob', 'mia', 'ron', 'ned'] del users[-1] Lists can contain millions of items, so Python provides an
efficient way to loop through all the items in a list. When
Removing an item by its value you set up a loop, Python pulls each item from the list one
users.remove('mia') at a time and stores it in a temporary variable, which you
Individual elements in a list are accessed according to their provide a name for. This name should be the singular
position, called the index. The index of the first element is version of the list name.
0, the index of the second element is 1, and so forth. The indented block of code makes up the body of the
Negative indices refer to items at the end of the list. To get If you want to work with an element that you're removing loop, where you can work with each individual item. Any
a particular element, write the name of the list and then the from the list, you can "pop" the element. If you think of the lines that are not indented run after the loop is completed.
index of the element in square brackets. list as a stack of items, pop() takes an item off the top of the
stack. By default pop() returns the last element in the list,
Printing all items in a list
Getting the first element but you can also pop elements from any position in the list. for user in users:
first_user = users[0] print(user)
Pop the last item from a list
Getting the second element most_recent_user = users.pop() Printing a message for each item, and a separate
print(most_recent_user) message afterwards
second_user = users[1]
for user in users:
Getting the last element Pop the first item in a list
print("Welcome, " + user + "!")
newest_user = users[-1] first_user = users.pop(0)
print(first_user) print("Welcome, we're glad to see you all!")
Use curly braces to define a dictionary. Use colons to Looping through all the keys
connect keys and values, and use commas to separate You can modify the value associated with any key in a # Show everyone who's taken the survey.
individual key-value pairs. dictionary. To do so give the name of the dictionary and for name in fav_languages.keys():
enclose the key in square brackets, then provide the new print(name)
Making a dictionary
value for that key.
alien_0 = {'color': 'green', 'points': 5} Looping through all the values
Modifying values in a dictionary
# Show all the languages that have been chosen.
alien_0 = {'color': 'green', 'points': 5} for language in fav_languages.values():
print(alien_0) print(language)
To access the value associated with an individual key give
the name of the dictionary and then place the key in a set of # Change the alien's color and point value. Looping through all the keys in order
square brackets. If the key you're asking for is not in the alien_0['color'] = 'yellow'
dictionary, an error will occur. # Show each person's favorite language,
alien_0['points'] = 10 # in order by the person's name.
You can also use the get() method, which returns None print(alien_0)
instead of an error if the key doesn't exist. You can also for name in sorted(fav_languages.keys()):
specify a default value to use if the key is not in the print(name + ": " + language)
dictionary.
Getting the value associated with a key You can remove any key-value pair you want from a
dictionary. To do so use the del keyword and the dictionary
You can find the number of key-value pairs in a dictionary.
alien_0 = {'color': 'green', 'points': 5} name, followed by the key in square brackets. This will
delete the key and its associated value. Finding a dictionary's length
print(alien_0['color'])
print(alien_0['points']) Deleting a key-value pair num_responses = len(fav_languages)
alien_0 = {'color': 'green', 'points': 5}
Getting the value with get()
print(alien_0)
alien_0 = {'color': 'green'}
del alien_0['points']
alien_color = alien_0.get('color') print(alien_0)
alien_points = alien_0.get('points', 0) Covers Python 3 and Python 2
print(alien_color)
print(alien_points) Try running some of these examples on pythontutor.com.
It's sometimes useful to store a set of dictionaries in a list; Storing a list inside a dictionary alows you to associate Standard Python dictionaries don't keep track of the order
this is called nesting. more than one value with each key. in which keys and values are added; they only preserve the
association between each key and its value. If you want to
Storing dictionaries in a list Storing lists in a dictionary preserve the order in which keys and values are added, use
# Start with an empty list. # Store multiple languages for each person. an OrderedDict.
users = [] fav_languages = { Preserving the order of keys and values
'jen': ['python', 'ruby'],
# Make a new user, and add them to the list. 'sarah': ['c'], from collections import OrderedDict
new_user = { 'edward': ['ruby', 'go'],
'last': 'fermi', 'phil': ['python', 'haskell'], # Store each person's languages, keeping
'first': 'enrico', } # track of who respoded first.
'username': 'efermi', fav_languages = OrderedDict()
} # Show all responses for each person.
users.append(new_user) for name, langs in fav_languages.items(): fav_languages['jen'] = ['python', 'ruby']
print(name + ": ") fav_languages['sarah'] = ['c']
# Make another new user, and add them as well. for lang in langs: fav_languages['edward'] = ['ruby', 'go']
new_user = { print("- " + lang) fav_languages['phil'] = ['python', 'haskell']
'last': 'curie',
'first': 'marie', # Display the results, in the same order they
'username': 'mcurie', # were entered.
} You can store a dictionary inside another dictionary. In this for name, langs in fav_languages.items():
users.append(new_user) case each value associated with a key is itself a dictionary. print(name + ":")
for lang in langs:
Storing dictionaries in a dictionary print("- " + lang)
# Show all information about each user.
for user_dict in users: users = {
for k, v in user_dict.items(): 'aeinstein': {
print(k + ": " + v) 'first': 'albert',
'last': 'einstein', You can use a loop to generate a large number of
print("\n")
'location': 'princeton', dictionaries efficiently, if all the dictionaries start out with
You can also define a list of dictionaries directly, }, similar data.
without using append(): 'mcurie': { A million aliens
# Define a list of users, where each user 'first': 'marie',
'last': 'curie', aliens = []
# is represented by a dictionary.
users = [ 'location': 'paris',
}, # Make a million green aliens, worth 5 points
{ # each. Have them all start in one row.
'last': 'fermi', }
for alien_num in range(1000000):
'first': 'enrico', new_alien = {}
'username': 'efermi', for username, user_dict in users.items():
print("\nUsername: " + username) new_alien['color'] = 'green'
}, new_alien['points'] = 5
{ full_name = user_dict['first'] + " "
full_name += user_dict['last'] new_alien['x'] = 20 * alien_num
'last': 'curie', new_alien['y'] = 0
'first': 'marie', location = user_dict['location']
aliens.append(new_alien)
'username': 'mcurie',
}, print("\tFull name: " + full_name.title())
print("\tLocation: " + location.title()) # Prove the list contains a million aliens.
] num_aliens = len(aliens)
# Show all information about each user. print("Number of aliens created:")
for user_dict in users: Nesting is extremely useful in certain situations. However, print(num_aliens)
for k, v in user_dict.items(): be aware of making your code overly complex. If you're
print(k + ": " + v) nesting items much deeper than what you see here there
print("\n") are probably simpler ways of managing your data, such as More cheat sheets available at
using classes.
Testing numerical values is similar to testing string values. Several kinds of if statements exist. Your choice of which to
use depends on the number of conditions you need to test.
Testing equality and inequality You can have as many elif blocks as you need, and the
>>> age = 18 else block is always optional.
>>> age == 18 Simple if statement
True
>>> age != 18 age = 19
False
if age >= 18:
Comparison operators print("You're old enough to vote!")
>>> age = 19 If-else statements
>>> age < 21
True age = 17
>>> age <= 21
True if age >= 18:
If statements allow you to examine the current state >>> age > 21 print("You're old enough to vote!")
of a program and respond appropriately to that state. False else:
>>> age >= 21 print("You can't vote yet.")
You can write a simple if statement that checks one
False
condition, or you can create a complex series of if The if-elif-else chain
statements that idenitfy the exact conditions you're
age = 12
looking for.
You can check multiple conditions at the same time. The if age < 4:
While loops run as long as certain conditions remain and operator returns True if all the conditions listed are price = 0
true. You can use while loops to let your programs True. The or operator returns True if any condition is True. elif age < 18:
run as long as your users want them to. Using and to check multiple conditions price = 5
else:
>>> age_0 = 22 price = 10
>>> age_1 = 18
A conditional test is an expression that can be evaluated as >>> age_0 >= 21 and age_1 >= 21 print("Your cost is $" + str(price) + ".")
True or False. Python uses the values True and False to False
decide whether the code in an if statement should be >>> age_1 = 23
executed. >>> age_0 >= 21 and age_1 >= 21
True You can easily test whether a certain value is in a list. You
Checking for equality
A single equal sign assigns a value to a variable. A double equal can also test whether a list is empty before trying to loop
Using or to check multiple conditions through the list.
sign (==) checks whether two values are equal.
>>> age_0 = 22 Testing if a value is in a list
>>> car = 'bmw'
>>> age_1 = 18
>>> car == 'bmw' >>> players = ['al', 'bea', 'cyn', 'dale']
>>> age_0 >= 21 or age_1 >= 21
True >>> 'al' in players
True
>>> car = 'audi' True
>>> age_0 = 18
>>> car == 'bmw' >>> 'eric' in players
>>> age_0 >= 21 or age_1 >= 21
False False
False
Ignoring case when making a comparison
>>> car = 'Audi'
>>> car.lower() == 'audi' A boolean value is either True or False. Variables with
True boolean values are often used to keep track of certain
conditions within a program.
Checking for inequality Covers Python 3 and Python 2
Simple boolean values
>>> topping = 'mushrooms'
>>> topping != 'anchovies' game_active = True
True can_edit = False
Testing if a value is not in a list Letting the user choose when to quit Using continue in a loop
banned_users = ['ann', 'chad', 'dee'] prompt = "\nTell me something, and I'll " banned_users = ['eve', 'fred', 'gary', 'helen']
user = 'erin' prompt += "repeat it back to you."
prompt += "\nEnter 'quit' to end the program. " prompt = "\nAdd a player to your team."
if user not in banned_users: prompt += "\nEnter 'quit' when you're done. "
print("You can play!") message = ""
while message != 'quit': players = []
Checking if a list is empty message = input(prompt) while True:
players = [] player = input(prompt)
if message != 'quit': if player == 'quit':
if players: print(message) break
for player in players: elif player in banned_users:
Using a flag print(player + " is banned!")
print("Player: " + player.title())
else: prompt = "\nTell me something, and I'll " continue
print("We have no players yet!") prompt += "repeat it back to you." else:
prompt += "\nEnter 'quit' to end the program. " players.append(player)
Simple input
if message == 'quit':
name = input("What's your name? ") active = False Every while loop needs a way to stop running so it won't
print("Hello, " + name + ".") else: continue to run forever. If there's no way for the condition to
print(message) become False, the loop will never stop running.
Accepting numerical input
Using break to exit a loop An infinite loop
age = input("How old are you? ")
age = int(age) prompt = "\nWhat cities have you visited?" while True:
prompt += "\nEnter 'quit' when you're done. " name = input("\nWho are you? ")
if age >= 18: print("Nice to meet you, " + name + "!")
print("\nYou can vote!") while True:
else: city = input(prompt)
print("\nYou can't vote yet.")
if city == 'quit': The remove() method removes a specific value from a list,
Accepting input in Python 2.7 break
Use raw_input() in Python 2.7. This function interprets all input as a
but it only removes the first instance of the value you
string, just as input() does in Python 3.
else: provide. You can use a while loop to remove all instances
print("I've been to " + city + "!") of a particular value.
name = raw_input("What's your name? ")
print("Hello, " + name + ".") Removing all cats from a list of pets
pets = ['dog', 'cat', 'dog', 'fish', 'cat',
Sublime Text doesn't run programs that prompt the user for
'rabbit', 'cat']
input. You can use Sublime Text to write programs that
prompt for input, but you'll need to run these programs from print(pets)
A while loop repeats a block of code as long as a condition
is True. a terminal.
while 'cat' in pets:
Counting to 5 pets.remove('cat')
current_number = 1 print(pets)
You can use the break statement and the continue
statement with any of Python's loops. For example you can
while current_number <= 5: use break to quit a for loop that's working through a list or a
print(current_number) dictionary. You can use continue to skip over certain items More cheat sheets available at
current_number += 1 when looping through a list or dictionary as well.
The two main kinds of arguments are positional and A function can return a value or a set of values. When a
keyword arguments. When you use positional arguments function returns a value, the calling line must provide a
Python matches the first argument in the function call with variable in which to store the return value. A function stops
the first parameter in the function definition, and so forth. running when it reaches a return statement.
With keyword arguments, you specify which parameter
each argument should be assigned to in the function call. Returning a single value
When you use keyword arguments, the order of the def get_full_name(first, last):
arguments doesn't matter. """Return a neatly formatted full name."""
Using positional arguments full_name = first + ' ' + last
return full_name.title()
def describe_pet(animal, name):
Functions are named blocks of code designed to do """Display information about a pet.""" musician = get_full_name('jimi', 'hendrix')
print("\nI have a " + animal + ".") print(musician)
one specific job. Functions allow you to write code
print("Its name is " + name + ".")
once that can then be run whenever you need to Returning a dictionary
accomplish the same task. Functions can take in the describe_pet('hamster', 'harry') def build_person(first, last):
information they need, and return the information they describe_pet('dog', 'willie') """Return a dictionary of information
generate. Using functions effectively makes your about a person.
programs easier to write, read, test, and fix. Using keyword arguments
"""
def describe_pet(animal, name): person = {'first': first, 'last': last}
"""Display information about a pet.""" return person
The first line of a function is its definition, marked by the print("\nI have a " + animal + ".")
keyword def. The name of the function is followed by a set print("Its name is " + name + ".") musician = build_person('jimi', 'hendrix')
of parentheses and a colon. A docstring, in triple quotes, print(musician)
describes what the function does. The body of a function is describe_pet(animal='hamster', name='harry')
describe_pet(name='willie', animal='dog') Returning a dictionary with optional values
indented one level.
To call a function, give the name of the function followed def build_person(first, last, age=None):
by a set of parentheses. """Return a dictionary of information
about a person.
Making a function You can provide a default value for a parameter. When
"""
function calls omit this argument the default value will be
def greet_user(): used. Parameters with default values must be listed after person = {'first': first, 'last': last}
"""Display a simple greeting.""" parameters without default values in the function's definition if age:
print("Hello!") so positional arguments can still work correctly. person['age'] = age
return person
greet_user() Using a default value
musician = build_person('jimi', 'hendrix', 27)
def describe_pet(name, animal='dog'):
print(musician)
"""Display information about a pet."""
Information that's passed to a function is called an print("\nI have a " + animal + ".")
musician = build_person('janis', 'joplin')
argument; information that's received by a function is called print("Its name is " + name + ".")
print(musician)
a parameter. Arguments are included in parentheses after
the function's name, and parameters are listed in describe_pet('harry', 'hamster')
parentheses in the function's definition. describe_pet('willie')
Try running some of these examples on pythontutor.com.
Passing a single argument Using None to make an argument optional
def greet_user(username): def describe_pet(animal, name=None):
"""Display a simple greeting.""" """Display information about a pet."""
print("Hello, " + username + "!") print("\nI have a " + animal + ".")
if name:
print("Its name is " + name + ".")
Covers Python 3 and Python 2
greet_user('jesse')
greet_user('diana')
greet_user('brandon') describe_pet('hamster', 'harry')
describe_pet('snake')
You can pass a list as an argument to a function, and the Sometimes you won't know how many arguments a You can store your functions in a separate file called a
function can work with the values in the list. Any changes function will need to accept. Python allows you to collect an module, and then import the functions you need into the file
the function makes to the list will affect the original list. You arbitrary number of arguments into one parameter using the containing your main program. This allows for cleaner
can prevent a function from modifying a list by passing a * operator. A parameter that accepts an arbitrary number of program files. (Make sure your module is stored in the
copy of the list as an argument. arguments must come last in the function definition. same directory as your main program.)
The ** operator allows a parameter to collect an arbitrary
Passing a list as an argument number of keyword arguments. Storing a function in a module
File: pizza.py
def greet_users(names): Collecting an arbitrary number of arguments
"""Print a simple greeting to everyone.""" def make_pizza(size, *toppings):
for name in names: def make_pizza(size, *toppings): """Make a pizza."""
msg = "Hello, " + name + "!" """Make a pizza.""" print("\nMaking a " + size + " pizza.")
print(msg) print("\nMaking a " + size + " pizza.") print("Toppings:")
print("Toppings:") for topping in toppings:
usernames = ['hannah', 'ty', 'margot'] for topping in toppings: print("- " + topping)
greet_users(usernames) print("- " + topping)
Importing an entire module
Allowing a function to modify a list File: making_pizzas.py
# Make three pizzas with different toppings. Every function in the module is available in the program file.
The following example sends a list of models to a function for
make_pizza('small', 'pepperoni')
printing. The original list is emptied, and the second list is filled. import pizza
make_pizza('large', 'bacon bits', 'pineapple')
def print_models(unprinted, printed): make_pizza('medium', 'mushrooms', 'peppers',
"""3d print a set of models.""" 'onions', 'extra cheese') pizza.make_pizza('medium', 'pepperoni')
while unprinted: pizza.make_pizza('small', 'bacon', 'pineapple')
current_model = unprinted.pop() Collecting an arbitrary number of keyword arguments
Importing a specific function
print("Printing " + current_model) def build_profile(first, last, **user_info): Only the imported functions are available in the program file.
printed.append(current_model) """Build a user's profile dictionary."""
# Build a dict with the required keys. from pizza import make_pizza
# Store some unprinted designs, profile = {'first': first, 'last': last}
# and print each of them. make_pizza('medium', 'pepperoni')
unprinted = ['phone case', 'pendant', 'ring'] # Add any other keys and values. make_pizza('small', 'bacon', 'pineapple')
printed = [] for key, value in user_info.items():
print_models(unprinted, printed) Giving a module an alias
profile[key] = value
import pizza as p
print("\nUnprinted:", unprinted) return profile
print("Printed:", printed) p.make_pizza('medium', 'pepperoni')
# Create two users with different kinds p.make_pizza('small', 'bacon', 'pineapple')
Preventing a function from modifying a list
The following example is the same as the previous one, except the # of information.
user_0 = build_profile('albert', 'einstein',
Giving a function an alias
original list is unchanged after calling print_models().
location='princeton') from pizza import make_pizza as mp
def print_models(unprinted, printed): user_1 = build_profile('marie', 'curie',
"""3d print a set of models.""" location='paris', field='chemistry') mp('medium', 'pepperoni')
while unprinted: mp('small', 'bacon', 'pineapple')
current_model = unprinted.pop() print(user_0)
print("Printing " + current_model) print(user_1) Importing all functions from a module
printed.append(current_model) Don't do this, but recognize it when you see it in others' code. It
can result in naming conflicts, which can cause errors.
# Store some unprinted designs, from pizza import *
# and print each of them. As you can see there are many ways to write and call a
original = ['phone case', 'pendant', 'ring'] function. When you're starting out, aim for something that make_pizza('medium', 'pepperoni')
printed = [] simply works. As you gain experience you'll develop an make_pizza('small', 'bacon', 'pineapple')
understanding of the more subtle advantages of different
print_models(original[:], printed) structures such as positional and keyword arguments, and
print("\nOriginal:", original) the various approaches to importing functions. For now if More cheat sheets available at
print("Printed:", printed) your functions do what you need them to, you're doing well.
If the class you're writing is a specialized version of another
Creating an object from a class class, you can use inheritance. When one class inherits
my_car = Car('audi', 'a4', 2016) from another, it automatically takes on all the attributes and
methods of the parent class. The child class is free to
Accessing attribute values introduce new attributes and methods, and override
attributes and methods of the parent class.
print(my_car.make)
To inherit from another class include the name of the
print(my_car.model)
parent class in parentheses when defining the new class.
print(my_car.year)
Classes are the foundation of object-oriented The __init__() method for a child class
Calling methods
programming. Classes represent real-world things
class ElectricCar(Car):
you want to model in your programs: for example my_car.fill_tank()
"""A simple model of an electric car."""
dogs, cars, and robots. You use a class to make my_car.drive()
objects, which are specific instances of dogs, cars, Creating multiple objects def __init__(self, make, model, year):
and robots. A class defines the general behavior that """Initialize an electric car."""
a whole category of objects can have, and the my_car = Car('audi', 'a4', 2016) super().__init__(make, model, year)
information that can be associated with those objects. my_old_car = Car('subaru', 'outback', 2013)
my_truck = Car('toyota', 'tacoma', 2010)
Classes can inherit from each other – you can # Attributes specific to electric cars.
write a class that extends the functionality of an # Battery capacity in kWh.
existing class. This allows you to code efficiently for a self.battery_size = 70
# Charge level in %.
wide variety of situations. You can modify an attribute's value directly, or you can
self.charge_level = 0
write methods that manage updating values more carefully.
Modifying an attribute directly Adding new methods to the child class
Consider how we might model a car. What information class ElectricCar(Car):
my_new_car = Car('audi', 'a4', 2016)
would we associate with a car, and what behavior would it --snip--
my_new_car.fuel_level = 5
have? The information is stored in variables called def charge(self):
attributes, and the behavior is represented by functions. Writing a method to update an attribute's value """Fully charge the vehicle."""
Functions that are part of a class are called methods. self.charge_level = 100
def update_fuel_level(self, new_level): print("The vehicle is fully charged.")
The Car class """Update the fuel level."""
class Car():
if new_level <= self.fuel_capacity: Using child methods and parent methods
self.fuel_level = new_level
"""A simple attempt to model a car.""" my_ecar = ElectricCar('tesla', 'model s', 2016)
else:
print("The tank can't hold that much!")
def __init__(self, make, model, year): my_ecar.charge()
"""Initialize car attributes.""" Writing a method to increment an attribute's value my_ecar.drive()
self.make = make
self.model = model def add_fuel(self, amount):
self.year = year """Add fuel to the tank."""
if (self.fuel_level + amount
# Fuel capacity and level in gallons. <= self.fuel_capacity): There are many ways to model real world objects and
self.fuel_capacity = 15 self.fuel_level += amount situations in code, and sometimes that variety can feel
self.fuel_level = 0 print("Added fuel.") overwhelming. Pick an approach and try it – if your first
else: attempt doesn't work, try a different approach.
def fill_tank(self): print("The tank won't hold that much.")
"""Fill gas tank to capacity."""
self.fuel_level = self.fuel_capacity
print("Fuel tank is full.")
In Python class names are written in CamelCase and object Covers Python 3 and Python 2
def drive(self): names are written in lowercase with underscores. Modules
"""Simulate driving.""" that contain classes should still be named in lowercase with
print("The car is moving.") underscores.
Class files can get long as you add detailed information and
Overriding parent methods functionality. To help keep your program files uncluttered, Classes should inherit from object
class ElectricCar(Car): you can store your classes in modules and import the class ClassName(object):
--snip-- classes you need into your main program.
def fill_tank(self): The Car class in Python 2.7
Storing classes in a file
"""Display an error message.""" car.py class Car(object):
print("This car has no fuel tank!")
"""Represent gas and electric cars.""" Child class __init__() method is different
class ChildClassName(ParentClass):
class Car():
def __init__(self):
A class can have objects as attributes. This allows classes """A simple attempt to model a car."""
super(ClassName, self).__init__()
to work together to model complex situations. --snip—
The ElectricCar class in Python 2.7
A Battery class class Battery():
"""A battery for an electric car.""" class ElectricCar(Car):
class Battery(): def __init__(self, make, model, year):
"""A battery for an electric car.""" --snip--
super(ElectricCar, self).__init__(
class ElectricCar(Car): make, model, year)
def __init__(self, size=70):
"""Initialize battery attributes.""" """A simple model of an electric car."""
# Capacity in kWh, charge level in %. --snip--
self.size = size Importing individual classes from a module A list can hold as many items as you want, so you can
self.charge_level = 0 my_cars.py make a large number of objects from a class and store
them in a list.
def get_range(self): from car import Car, ElectricCar Here's an example showing how to make a fleet of rental
"""Return the battery's range.""" cars, and make sure all the cars are ready to drive.
if self.size == 70: my_beetle = Car('volkswagen', 'beetle', 2016)
my_beetle.fill_tank() A fleet of rental cars
return 240
elif self.size == 85: my_beetle.drive() from car import Car, ElectricCar
return 270
my_tesla = ElectricCar('tesla', 'model s', # Make lists to hold a fleet of cars.
Using an instance as an attribute 2016) gas_fleet = []
my_tesla.charge() electric_fleet = []
class ElectricCar(Car):
my_tesla.drive()
--snip--
Importing an entire module # Make 500 gas cars and 250 electric cars.
def __init__(self, make, model, year): for _ in range(500):
"""Initialize an electric car.""" import car car = Car('ford', 'focus', 2016)
super().__init__(make, model, year) gas_fleet.append(car)
my_beetle = car.Car( for _ in range(250):
# Attribute specific to electric cars. 'volkswagen', 'beetle', 2016) ecar = ElectricCar('nissan', 'leaf', 2016)
self.battery = Battery() my_beetle.fill_tank() electric_fleet.append(ecar)
my_beetle.drive()
def charge(self): # Fill the gas cars, and charge electric cars.
"""Fully charge the vehicle.""" my_tesla = car.ElectricCar( for car in gas_fleet:
self.battery.charge_level = 100 'tesla', 'model s', 2016) car.fill_tank()
print("The vehicle is fully charged.") my_tesla.charge() for ecar in electric_fleet:
my_tesla.drive() ecar.charge()
Using the instance
Importing all classes from a module
my_ecar = ElectricCar('tesla', 'model x', 2016) (Don’t do this, but recognize it when you see it.) print("Gas cars:", len(gas_fleet))
print("Electric cars:", len(electric_fleet))
from car import *
my_ecar.charge()
print(my_ecar.battery.get_range())
my_beetle = Car('volkswagen', 'beetle', 2016)
More cheat sheets available at
my_ecar.drive()
Storing the lines in a list Opening a file using an absolute path
filename = 'siddhartha.txt' f_path = "/home/ehmatthes/books/alice.txt"
import unittest E
from full_names import get_full_name ================================================
ERROR: test_first_last (__main__.NamesTestCase)
class NamesTestCase(unittest.TestCase): Test names like Janis Joplin.
"""Tests for names.py.""" ------------------------------------------------
Traceback (most recent call last):
def test_first_last(self): File "test_full_names.py", line 10,
When you write a function or a class, you can also """Test names like Janis Joplin.""" in test_first_last
write tests for that code. Testing proves that your full_name = get_full_name('janis', 'joplin')
code works as it's supposed to in the situations it's 'joplin') TypeError: get_full_name() missing 1 required
designed to handle, and also when people use your self.assertEqual(full_name, positional argument: 'last'
programs in unexpected ways. Writing tests gives 'Janis Joplin')
you confidence that your code will work correctly as ------------------------------------------------
more people begin to use your programs. You can unittest.main() Ran 1 test in 0.001s
also add new features to your programs and know Running the test
that you haven't broken existing behavior. FAILED (errors=1)
Python reports on each unit test in the test case. The dot reports a
single passing test. Python informs us that it ran 1 test in less than Fixing the code
A unit test verifies that one specific aspect of your 0.001 seconds, and the OK lets us know that all unit tests in the When a test fails, the code needs to be modified until the test
test case passed. passes again. (Don’t make the mistake of rewriting your tests to fit
code works as it's supposed to. A test case is a
your new code.) Here we can make the middle name optional.
collection of unit tests which verify your code's .
behavior in a wide variety of situations. --------------------------------------- def get_full_name(first, last, middle=''):
Ran 1 test in 0.000s """Return a full name."""
if middle:
OK full_name = "{0} {1} {2}".format(first,
Python's unittest module provides tools for testing your middle, last)
code. To try it out, we’ll create a function that returns a full else:
name. We’ll use the function in a regular program, and then full_name = "{0} {1}".format(first,
build a test case for the function. Failing tests are important; they tell you that a change in the
last)
code has affected existing behavior. When a test fails, you
A function to test return full_name.title()
need to modify the code so the existing behavior still works.
Save this as full_names.py
Modifying the function Running the test
def get_full_name(first, last): Now the test should pass again, which means our original
We’ll modify get_full_name() so it handles middle names, but
"""Return a full name.""" functionality is still intact.
we’ll do it in a way that breaks existing behavior.
full_name = "{0} {1}".format(first, last) .
return full_name.title() def get_full_name(first, middle, last):
---------------------------------------
"""Return a full name."""
Ran 1 test in 0.000s
Using the function full_name = "{0} {1} {2}".format(first,
Save this as names.py middle, last)
OK
return full_name.title()
from full_names import get_full_name
Using the function
janis = get_full_name('janis', 'joplin')
print(janis) from full_names import get_full_name
x_values = [0, 1, 2, 3, 4, 5]
squares = [0, 1, 4, 9, 16, 25]
Covers Python 3 and Python 2
plt.plot(x_values, squares)
plt.show()
You can make as many plots as you want on one figure. You can include as many individual graphs in one figure as
When you make multiple plots, you can emphasize Datetime formatting arguments you want. This is useful, for example, when comparing
The strftime() function generates a formatted string from a
relationships in the data. For example you can fill the space related datasets.
datetime object, and the strptime() function genereates a
between two sets of data. datetime object from a string. The following codes let you work with Sharing an x-axis
Plotting two sets of data dates exactly as you need to. The following code plots a set of squares and a set of cubes on
Here we use plt.scatter() twice to plot square numbers and %A Weekday name, such as Monday two separate graphs that share a common x-axis.
cubes on the same figure. The plt.subplots() function returns a figure object and a tuple
%B Month name, such as January of axes. Each set of axes corresponds to a separate plot in the
import matplotlib.pyplot as plt %m Month, as a number (01 to 12) figure. The first two arguments control the number of rows and
%d Day of the month, as a number (01 to 31) columns generated in the figure.
x_values = list(range(11)) %Y Four-digit year, such as 2016
%y Two-digit year, such as 16 import matplotlib.pyplot as plt
squares = [x**2 for x in x_values]
cubes = [x**3 for x in x_values] %H Hour, in 24-hour format (00 to 23)
%I Hour, in 12-hour format (01 to 12) x_vals = list(range(11))
%p AM or PM squares = [x**2 for x in x_vals]
plt.scatter(x_values, squares, c='blue',
%M Minutes (00 to 59) cubes = [x**3 for x in x_vals]
edgecolor='none', s=20)
plt.scatter(x_values, cubes, c='red', %S Seconds (00 to 61)
fig, axarr = plt.subplots(2, 1, sharex=True)
edgecolor='none', s=20)
Converting a string to a datetime object
axarr[0].scatter(x_vals, squares)
plt.axis([0, 11, 0, 1100]) new_years = dt.strptime('1/1/2017', '%m/%d/%Y')
axarr[0].set_title('Squares')
plt.show()
Converting a datetime object to a string
Filling the space between data sets axarr[1].scatter(x_vals, cubes, c='red')
The fill_between() method fills the space between two data ny_string = dt.strftime(new_years, '%B %d, %Y') axarr[1].set_title('Cubes')
sets. It takes a series of x-values and two series of y-values. It also print(ny_string)
takes a facecolor to use for the fill, and an optional alpha plt.show()
argument that controls the color’s transparency. Plotting high temperatures
The following code creates a list of dates and a corresponding list
Sharing a y-axis
plt.fill_between(x_values, cubes, squares, of high temperatures. It then plots the high temperatures, with the
To share a y-axis, we use the sharey=True argument.
facecolor='blue', alpha=0.25) date labels displayed in a specific format.
from datetime import datetime as dt import matplotlib.pyplot as plt
outcomes = [1, 2, 3, 4, 5, 6]
To make a plot with Pygal, you specify the kind of plot and frequencies = [18, 16, 18, 17, 18, 13]
then add the data.
chart = pygal.Bar()
Making a line graph
To view the output, open the file squares.svg in a browser.
chart.force_uri_protocol = 'http'
chart.x_labels = outcomes
import pygal chart.add('D6', frequencies)
chart.render_to_file('rolling_dice.svg')
x_values = [0, 1, 2, 3, 4, 5]
squares = [0, 1, 4, 9, 16, 25] Making a bar graph from a dictionary
Since each bar needs a label and a value, a dictionary is a great The documentation for Pygal is available at
way to store the data for a bar graph. The keys are used as the http://www.pygal.org/.
chart = pygal.Line() labels along the x-axis, and the values are used to determine the
chart.force_uri_protocol = 'http' height of each bar.
chart.add('x^2', squares)
chart.render_to_file('squares.svg') import pygal If you’re viewing svg output in a browser, Pygal needs to
render the output file in a specific way. The
Adding labels and a title results = { force_uri_protocol attribute for chart objects needs to
--snip-- 1:18, 2:16, 3:18, be set to 'http'.
chart = pygal.Line() 4:17, 5:18, 6:13,
chart.force_uri_protocol = 'http' }
chart.title = "Squares"
chart.x_labels = x_values chart = pygal.Bar()
chart.x_title = "Value" chart.force_uri_protocol = 'http' Covers Python 3 and Python 2
chart.y_title = "Square of Value" chart.x_labels = results.keys()
chart.add('x^2', squares) chart.add('D6', results.values())
chart.render_to_file('squares.svg') chart.render_to_file('rolling_dice.svg')
Pygal lets you customize many elements of a plot. There Pygal can generate world maps, and you can add any data
are some excellent default themes, and many options for Configuration settings you want to these maps. Data is indicated by coloring, by
Some settings are controlled by a Config object.
styling individual plot elements. labels, and by tooltips that show data when users hover
my_config = pygal.Config() over each country on the map.
Using built-in styles
my_config.show_y_guides = False
To use built-in styles, import the style and make an instance of the Installing the world map module
style class. Then pass the style object with the style argument my_config.width = 1000 The world map module is not included by default in Pygal 2.0. It
when you make the chart object. my_config.dots_size = 5 can be installed with pip:
import pygal chart = pygal.Line(config=my_config) $ pip install --user pygal_maps_world
from pygal.style import LightGreenStyle --snip-- Making a world map
x_values = list(range(11)) Styling series The following code makes a simple world map showing the
You can give each series on a chart different style settings. countries of North America.
squares = [x**2 for x in x_values]
cubes = [x**3 for x in x_values] chart.add('Squares', squares, dots_size=2) from pygal.maps.world import World
chart.add('Cubes', cubes, dots_size=3)
chart_style = LightGreenStyle() wm = World()
chart = pygal.Line(style=chart_style) Styling individual data points wm.force_uri_protocol = 'http'
chart.force_uri_protocol = 'http' You can style individual data points as well. To do so, write a wm.title = 'North America'
chart.title = "Squares and Cubes" dictionary for each data point you want to customize. A 'value' wm.add('North America', ['ca', 'mx', 'us'])
chart.x_labels = x_values key is required, and other properies are optional.
import pygal wm.render_to_file('north_america.svg')
chart.add('Squares', squares)
chart.add('Cubes', cubes) Showing all the country codes
repos = [ In order to make maps, you need to know Pygal’s country codes.
chart.render_to_file('squares_cubes.svg') { The following example will print an alphabetical list of each country
'value': 20506, and its code.
Parametric built-in styles
Some built-in styles accept a custom color, then generate a theme 'color': '#3333CC',
'xlink': 'http://djangoproject.com/', from pygal.maps.world import COUNTRIES
based on that color.
},
from pygal.style import LightenStyle 20054, for code in sorted(COUNTRIES.keys()):
12607, print(code, COUNTRIES[code])
--snip-- 11827,
chart_style = LightenStyle('#336688')
Plotting numerical data on a world map
] To plot numerical data on a map, pass a dictionary to add()
chart = pygal.Line(style=chart_style) instead of a list.
--snip-- chart = pygal.Bar()
chart.force_uri_protocol = 'http' from pygal.maps.world import World
Customizing individual style properties
Style objects have a number of properties you can set individually. chart.x_labels = [
populations = {
'django', 'requests', 'scikit-learn',
chart_style = LightenStyle('#336688') 'ca': 34126000,
'tornado',
chart_style.plot_background = '#CCCCCC' ] 'us': 309349000,
chart_style.major_label_font_size = 20 chart.y_title = 'Stars' 'mx': 113423000,
chart_style.label_font_size = 16 chart.add('Python Repos', repos) }
--snip-- chart.render_to_file('python_repos.svg')
wm = World()
Custom style class wm.force_uri_protocol = 'http'
You can start with a bare style class, and then set only the wm.title = 'Population of North America'
properties you care about. wm.add('North America', populations)
chart_style = Style()
chart_style.colors = [ wm.render_to_file('na_populations.svg')
'#CCCCCC', '#AAAAAA', '#888888']
chart_style.plot_background = '#EEEEEE'
CREATE TEMPORARY VIEW v MIN returns the minimum value in a list CREATE TRIGGER before_insert_person
AS BEFORE INSERT
SELECT c1, c2 ON person FOR EACH ROW
FROM t; EXECUTE stored_procedure;
Create a temporary view Create a trigger invoked before a new row is
inserted into the person table
September 9, 2018
Regression Classifier
r Gradient descent – By noting α ∈ R the learning rate, the update rule for gradient descent
Outcome Continuous Class is expressed with the learning rate and the cost function J as follows:
Examples Linear regression Logistic regression, SVM, Naive Bayes θ ←− θ − α∇J(θ)
r Type of model – The different models are summed up in the table below:
Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training
example, and batch gradient descent is on a batch of training examples.
Illustration
r Likelihood – The likelihood of a model L(θ) given parameters θ is used to find the optimal
parameters θ through maximizing the likelihood. In practice, we use the log-likelihood `(θ) =
log(L(θ)) which is easier to optimize. We have:
θopt = arg max L(θ)
Examples Regressions, SVMs GDA, Naive Bayes θ
r Newton’s algorithm – The Newton’s algorithm is a numerical method that finds θ such
Notations and general concepts that `0 (θ) = 0. Its update rule is as follows:
`0 (θ)
r Hypothesis – The hypothesis is noted hθ and is the model that we choose. For a given input θ←θ−
`00 (θ)
data x(i) , the model prediction output is hθ (x(i) ).
Remark: the multidimensional generalization, also known as the Newton-Raphson method, has
r Loss function – A loss function is a function L : (z,y) ∈ R × Y 7−→ L(z,y) ∈ R that takes as the following update rule:
inputs the predicted value z corresponding to the real data value y and outputs how different −1
they are. The common loss functions are summed up in the table below: θ ← θ − ∇2θ `(θ) ∇θ `(θ)
We assume here that y|x; θ ∼ N (µ,σ 2 ) r Exponential family – A class of distributions is said to be in the exponential family if it can
be written in terms of a natural parameter, also called the canonical parameter or link function,
r Normal equations – By noting X the matrix design, the value of θ that minimizes the cost
η, a sufficient statistic T (y) and a log-partition function a(η) as follows:
function is a closed-form solution such that:
Remark: we will often have T (y) = y. Also, exp(−a(η)) can be seen as a normalization param-
r LMS algorithm – By noting α the learning rate, the update rule of the Least Mean Squares eter that will make sure that the probabilities sum to one.
(LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff
learning rule, is as follows: Here are the most common exponential distributions summed up in the following table:
m
X (i)
∀j, θj ← θj + α y (i) − hθ (x(i) ) xj
i=1
Distribution η T (y) a(η) b(y)
φ
Bernoulli log 1−φ
y log(1 + exp(η)) 1
Remark: the update rule is a particular case of the gradient ascent.
η2 2
r LWR – Locally Weighted Regression, also known as LWR, is a variant of linear regression that Gaussian µ y 2
√1 exp − y2
2π
weights each training example in its cost function by w(i) (x), which is defined with parameter
τ ∈ R as: 1
Poisson log(λ) y eη
y!
(x(i) − x)2
w(i) (x) = exp − eη
2τ 2 Geometric log(1 − φ) y log 1−eη
1
Classification and logistic regression r Assumptions of GLMs – Generalized Linear Models (GLM) aim at predicting a random
variable y as a function fo x ∈ Rn+1 and rely on the following 3 assumptions:
r Sigmoid function – The sigmoid function g, also known as the logistic function, is defined
as follows:
(1) y|x; θ ∼ ExpFamily(η) (2) hθ (x) = E[y|x; θ] (3) η = θT x
1
∀z ∈ R, g(z) = ∈]0,1[
1 + e−z
Remark: ordinary least squares and logistic regression are special cases of generalized linear
models.
r Logistic regression – We assume here that y|x; θ ∼ Bernoulli(φ). We have the following
form:
exp(θiT x)
φi = where (w, b) ∈ Rn × R is the solution of the following optimization problem:
K
X
exp(θjT x) 1
min ||w||2 such that y (i) (wT x(i) − b) > 1
j=1 2
y ∼ Bernoulli(φ)
r Estimation – The following table sums up the estimates that we find when maximizing the
likelihood:
φ
b µbj (j = 0,1) Σ
b
Remark: the line is defined as wT x − b = 0 . m Pm m
1 X 1
i=1 {y (i) =j}
x(i) 1 X
1{y(i) =1} (x(i) − µy(i) )(x(i) − µy(i) )T
r Hinge loss – The hinge loss is used in the setting of SVMs and is defined as follows: m
1{y(i) =j}
P
m m
i=1 i=1 i=1
L(z,y) = [1 − yz]+ = max(0,1 − yz)
r Kernel – Given a feature mapping φ, we define the kernel K to be defined as: Naive Bayes
K(x,z) = φ(x)T φ(z)
r Assumption – The Naive Bayes model supposes that the features of each data point are all
independent:
||x−z||2
In practice, the kernel K defined by K(x,z) = exp − 2σ2 is called the Gaussian kernel
n
and is commonly used.
Y
P (x|y) = P (x1 ,x2 ,...|y) = P (x1 |y)P (x2 |y)... = P (xi |y)
i=1
r Solutions – Maximizing the log-likelihood gives the following solutions, with k ∈ {0,1},
l ∈ [[1,L]]
(j)
1 #{j|y (j) = k and xi = l}
P (y = k) = × #{j|y (j) = k} and P (xi = l|y = k) =
m #{j|y (j) = k}
Remark: we say that we use the "kernel trick" to compute the cost function using the kernel Remark: Naive Bayes is widely used for text classification and spam detection.
because we actually don’t need to know the explicit mapping φ, which is often very complicated.
Instead, only the values K(x,z) are needed.
Tree-based and ensemble methods
r Lagrangian – We define the Lagrangian L(w,b) as follows:
l
X These methods can be used for both regression and classification problems.
L(w,b) = f (w) + βi hi (w) r CART – Classification and Regression Trees (CART), commonly known as decision trees,
i=1
can be represented as binary trees. They have the advantage to be very interpretable.
Remark: the coefficients βi are called the Lagrange multipliers. r Random forest – It is a tree-based technique that uses a high number of decision trees
built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly
uninterpretable but its generally good performance makes it a popular algorithm.
Generative Learning Remark: random forests are a type of ensemble methods.
A generative model first tries to learn how the data is generated by estimating P (x|y), which r Boosting – The idea of boosting methods is to combine several weak learners to form a
we can then use to estimate P (y|x) by using Bayes’ rule. stronger one. The main ones are summed up in the table below:
r k-nearest neighbors – The k-nearest neighbors algorithm, commonly known as k-NN, is a • the training and testing sets follow the same distribution
non-parametric approach where the response of a data point is determined by the nature of its
k neighbors from the training set. It can be used in both classification and regression settings. • the training examples are drawn independently
Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the
higher the variance.
r Shattering – Given a set S = {x(1) ,...,x(d) }, and a set of classifiers H, we say that H shatters
S if for any set of labels {y (1) , ..., y (d) }, we have:
r Upper bound theorem – Let H be a finite hypothesis class such that |H| = k and let δ and
the sample size m be fixed. Then, with probability of at least 1 − δ, we have:
r
1 2k
(h) 6 min (h) + 2
b log
h∈H 2m δ
r Hoeffding inequality – Let Z1 , .., Zm be m iid variables drawn from a Bernoulli distribution
of parameter φ. Let φ
b be their sample mean and γ > 0 fixed. We have:
P (|φ − φ
b| > γ) 6 2 exp(−2γ 2 m)