Professional Documents
Culture Documents
➢ Trainer clunven
➢ Public Speaker
➢ Developers Support clun
➢ Developer Applications
➢ Developer Tooling
Cédrick Lunven
Developer Advocate @stefano-lottini
@hemidactylus
@rsprrs
➢ Developer/Architect
➢ Apache Cassandra™ certified
dtsx.io/stefano
Stefano Lottini
��🇸
��🇷 ��🇸
��🇸
��🇹
��🇺
��🇸 ��🇸
��🇸 ��🇸
YouTube
YouTube
(with nightbot)
Twitch Discord
(#workshop-chat)
⌨ !discord
⌨ !menti
Live Sessions 5
No
th
Source code + exercises + slides IDE
in
g
to
c c
in
st
al
l
!
⌨ !github ⌨ !gitpod
⌨ !astra
Hands-On Housekeeping
57
2
an
d
co
un
ti
ng
Achievement Unlocked !
01 02 03
Machine Learning Setup your env Learn about your tools
Definitions, Process, Metrics Database, IDE, Tools Cassandra, Jupyter, Spark
04 05 06
Algos 1 Algos 2 What’s next?
Clustering Classification Quiz, Homework, Next week
Inference Recommendation
Agenda 8
01 02 03
Machine Learning Setup your env Learn about your tools
Definitions, Process, Metrics Database, IDE, Tools Cassandra, Jupyter, Spark
04 05 06
Algos 1 Algos 2 What’s next?
Clustering Classification Quiz, Homework, Next week
Inference Recommendation
Agenda 9
Machine Learning
Machine learning is the scientific study of algorithms and statistical
models that computer systems use to perform a specific task without
using explicit instructions, relying on patterns and inference instead. It is
seen as a subset of artificial intelligence. Machine learning algorithms
build a mathematical model based on sample data, known as "training
data", in order to make predictions or decisions without being explicitly
programmed to perform the task. Machine learning algorithms are used
in a wide variety of applications, such as email filtering and computer
vision, where it is difficult or infeasible to develop a conventional
algorithm for effectively performing the task.
Wikipedia.org
Decisions
How it works ?
● Knowledge of the output: learn with expert
○ Data are labelled with class or value
○ Goal: predict the class
Supervised Learning
● No Knowledge of the output: self-guided
○ Data are not labelled with class or value
○ Goal: Determine Patterns of Grouping
Unsupervised Learning
● FP-Growth ● Apriori Algorithm
● K-Means Clustering ● Case-based Reasoning
● Naive Bayes ● Dimensionality Reduction
● Decision Trees Algorithms
● Neural Networks ● Gradient Boosting Algorithms
● Collaborative Filtering ● Hidden Markov Models
● Logistic Regression ● Self-organizing Map
● Support Vector Machines ● K-Nearest Neighbour
● Linear Regression ● ECLAT
Algorithms
Pre Categorized Data
Supervised Classification Naive Bayes
Learning
Predictive Models Regressions Random Forest
Semi Supervised
Clustering K-means
Unlabelled data
Unsupervised Association FP-Growth
Learning
Pattern Recognition Dimensionality Collaborative Filtering
Reduction
Reinforcement
Learning
1 1
8 90
Accuracy
20
Accuracy
Precision counts true positives out
of all true and false positives.
Precision
Precision
Recall correctly identified positives
out of all real positives.
Recall
Recall
Anub Bhande @ Medium.com
04 05 06
Algos 1 Algos 2 What’s next?
Clustering Classification Quiz, Homework, Next week
Inference Recommendation
Agenda 27
APPLICATION LAYER DATA LAYER
Jupyter Notebook
Cli
DSBulk
Apache Cassandra
Apache Cassandra™
Country City Population
Partition Key
Data is Distributed
USA New York 8.000.000
Country City Population
USA Los Angeles 4.000.000
FR Paris 2.230.000
DE Berlin 3.350.000
FR Toulouse 1.100.000
DE Nuremberg 500.000
Data is Distributed
And many others...
dtsx.io/cassandra-at-netflix
Astra Astra
DB Streaming
04 05 06
Algos 1 Algos 2 What’s next?
Clustering Classification Quiz, Homework, Next week
Inference Recommendation
Agenda 36
APPLICATION LAYER DATA LAYER
Jupyter Notebook
Cli
DSBulk
Apache Cassandra
Gitpod
38
Gitpod
APPLICATION LAYER DATA LAYER
Apache Cassandra
Gitpod
40
Docker Engine is Client-server application with :
Docker
41
APPLICATION LAYER DATA LAYER
Jupyter notebook
Apache Cassandra
Jupyter Notebook
42
Python is an interpreted, high-level, general-purpose programming
language. Created by Guido van Rossum and first released in 1991, Python's
design philosophy emphasizes code readability with its notable use of
significant whitespace.
Python
Pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with “relational” or “labeled” data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, real world data analysis in Python.
Panda
Apache Spark is written in Scala programming language. PySpark has been
released in order to support the collaboration of Apache Spark and Python, it
actually is a Python API for Spark. In addition, PySpark, helps you interface with
Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming
language.
Py Spark
NumPy is the fundamental package for scientific computing
with Python.
Num Py
An open source, simple and efficient tool for predictive data analysis, accessible
to everybody, and reusable in various contexts. Built on NumPy, SciPy, and
matplotlib.
04 05 06
Algos 1 Algos 2 What’s next?
Clustering Classification Quiz, Homework, Next week
Inference Recommendation
Agenda 48
● Question / Hypothesis
● Algorithm Selection
● Data Preparation
● Data Split
● Training
● Tuning
● Testing
● Analysis
● Repeat
Learning Workflow
Train
Raw Data Training Set New Data
Machine
Model
Learning
Data with
Validating Set Final
features and
Tune Model
labels
Estimate
Testing Set Predicted Labels
Training
Preparation Split Prediction
Data Preparation
50
Train
Raw Data Training Set New Data
Machine
Model
Learning
Data with
Validating Set Final
features and
Tune Model
labels
Estimate
Testing Set Predicted Labels
Training
Preparation Split Prediction
Data Split
51
Train
Raw Data Training Set New Data
Machine
Model
Learning
Data with
Validating Set Final
features and
Tune Model
labels
Estimate
Testing Set Predicted Labels
Training
Preparation Split Prediction
Training
52
FUN FACT: this image was created by … an algorithm,
starting from the textual prompt: "a metallic cyborg in a gym"
https://huggingface.co/spaces/stabilityai/stable-diffusion
Machine
Model
Learning
Data with
Validating Set Final
features and Tune
Model
labels
Estimate
Testing Set Predicted Labels
Training
Preparation Split Prediction
Tuning
54
Train
Raw Data Training Set New Data
Machine
Model
Learning
Data with
Validating Set Final
features and
Tune Model
labels
Estimate
Testing Set Predicted Labels
Training
Preparation Split Prediction
Testing
55
Train
Raw Data Training Set New Data
Machine
Model
Learning
Data with
Validating Set Final
features and
Tune Model
labels
Estimate
Testing Set Predicted Labels
Training
Preparation Split Prediction
Final Model
56
Train
Raw Data Training Set New Data
Machine
Model
Learning
Data with
Validating Set Final
features and
Tune Model
labels
Estimate
Testing Set Predicted Labels
Training
Preparation Split Prediction
Prediction
57
Demo with K-means
http://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/
Lab 2
K Means
https://github.com/datastaxdevs/work
shop-introduction-to-machine-learning
#6-algorithms
59
Naïve Bayes
Supervised Classification Algorithm
“Naïve Bayes classifiers are a family of simple "probabilistic
classifiers" based on applying Bayes' theorem with strong
“naïve” independence assumptions between the features.
I miss you
● Money transfer received P(spam | ‘Easy’, ‘Money’) = P(‘Easy’, ‘Money’ | spam) * P (spam)
● Get cash easy! = P(‘Easy’| spam) * P(‘Money’ | spam) * P (spam)
● Win money now! = ⅓ ⅔ ⅜
= 1/12
NOT SPAM
P(not | ‘Easy’, ‘Money’) = P(‘Easy’, ‘Money’ | not) * P (not)
= P(‘Easy’| not) * P(‘Money’ | not) * P (not)
● Greetings from Dad = ⅕ * ⅕ * ⅝
● Could you lend me = 1/40
money?
● It was easy!
P(spam | ‘Easy’, ‘Money’) + P(not | ‘Easy’, ‘Money’) = 1
● I miss you
● Check your trip itinerary
P(spam | ‘Easy’, ‘Money’) = P(spam | ‘Easy’, ‘Money’) = 1/12
P(spam | ‘Easy’, ‘Money’) + P(not | ‘Easy’, ‘Money’) 1/12 + 1/40
65
Random Forest
Supervised Classification Ensemble Method
“A decision tree is a simple representation for classifying.
Titanic Survival Decision Tree
Assume that all of the input features have finite discrete
domains, and there is a single target feature called the
"classification". Each element of the domain of the
classification is called a class. A decision tree is a tree in which
each internal node is labeled with an input feature. The arcs
coming from a node labeled with an input feature are labeled
with each of the possible values of the target or output feature
or the arc leads to a subordinate decision node on a different
input feature. Each leaf of the tree is labeled with a class or a
probability distribution over the classes, signifying that the
data set has been classified by the tree into either a specific
class, or into a particular probability distribution”
Under-fitter vs over-fitted
Random Forest Ensemble Method
71
FP-Growth
Association-rule-based recommendation system
Association-rule learning
A class of models to infer relations between elements in
a list of sets. E.g. market-basket analysis:
"Customers who bought x, y, z also bought w" user1: bread, tomatoes, mayonnaise
user2: ham, cheese, mayonnaise
Generally proceed in two steps: user3: bread, ham, mayonnaise, pickles
user4: beer, diapers :)
1. identify items that are often together user5: beer, ham, milk
2. recast this information as "association rules" …
FP-Growth
A method to build the "frequent itemsets" in 2 passes
FP-Growth
Collaborative Filtering
Matrix-factorization-based recommendation system
Users "collaborate" in identifying your tastes
Idea: "If you share the wine and snack tastes of a group "vote" to item j by user i
m (users)
A= 8
A = U *
σ3
0 D * VT
0
mxm 0 mxn
n (items)
nxn
3 nonzero singular values = rank 3 = low-dimensional reduction
Collaborative Filtering
01 02 03
Machine Learning Setup your env Learn about your tools
Definitions, Process, Metrics Database, IDE, Tools Cassandra, Jupyter, Spark
04 05 06
Algos 1 Algos 2 What’s next?
Clustering Classification Quiz, Homework, Next week
Inference Recommendation
Agenda 76
Homework
⌨!homework
!discord
dtsx.io/discord
Discord Community
Thank You!