You are on page 1of 21

Introduction to Data Science

and Machine Learning


What is Data Science

A multi disciplinary field of


research that uses scientific
methods, processes and
algorithms to extract information
from data.

Internal Use - Confidential


What do you need to know in DS?

• Algebra (data is represented as a matrix)


Math • Calculus (for optimization algorithms)

• Moments, distributions, correlations


Statistics • Hypothesis testing

• Programming languages (R/Python). SQL


Computer Science • Programming algorithms.

• Good understanding of the problem


Domain expertise • Discuss with experienced people

Internal Use - Confidential


What types of data do you know?

Internal Use - Confidential


Unstructured data

Internal Use - Confidential


Structured data

Internal Use - Confidential


What is Machine Learning?

Machine learning (ML) is


the study of computer
algorithms that improve
automatically through
experience and by the use
of data.

Internal Use - Confidential


Types of learning

Internal Use - Confidential


ML Algorithms
- Linear Regression
- Logistic Regresion
Regression - Decision Trees
- SVM
Supervised - Random Forest
Classification - Neural Networks
Machine
- Discriminate Analysis
Learning - KNN
- Hierarchical
Clustering
Unsupervised
- K – Means

Dimension - PCA
Reduction - SVD
- Embeddings

Internal Use - Confidential


Problems you can you solve with ML

Internal Use - Confidential


Neural networks (NN)

• Electrical charge comes through


dendrites.
• Once a certain electrical potential is
reached, the electrical signal propagates
through the axon.
• Axon terminal are connected with other
dendrites.
• The process goes on.

Internal Use - Confidential


NN mathematical analogy

• , … are the input feature (columns, variables)


• are weights for each feature.
• is the linear combination between inputs and
weights.
• ʃ is the activation function.

Internal Use - Confidential


NN – forward propagation

• The model outputs probabilities and we


need to decide if the ‘signal’ will pass
forward.
• The threshold will be useful to decide if
an application is fraud.
• IF the output is larger than 0.5, then it is
a fraud.

X1 X2 X3 X4 Y_hat Y
Num_Empl Capital Financed Assets W Sumator Sigmoid Threshold=0.5 Real fraud
Company 1 2 0 3 0 1 Company 1 3.2 0.039166 0 0
Company 2 5 2 1 2 * -0.9 = Company 2 1.2 -> 0.231475 -> 0 -> 1
Company 3 1 1 2 2 0.4 Company 3 -1.5 0.817574 1 0
-1.2

Internal Use - Confidential


NN - loss

Loss function:
• Depends on weights vector
• It is generally convex, if not on the full domain, at least on
intervals.
• Generates the optimization problem: reduce the loss with
respect to weights.
• Finding the global minimum means finding the best model.

𝑚
1
𝐽 (𝑤)= ∗∑ ¿ ¿ ¿
𝑚 𝑖 =1
Y_hat Y
Threshold=0.5 Real fraud Loss
0 0 0.666667
0 -> 1 ->
1 0
Internal Use - Confidential
NN – forward and backward

Internal Use - Confidential


NN – what to optimize?
Neurons

Hidden layers

Learning Rate

Loss Function
Hyper Parameters
Optimizer

Metric

Dropout

Early stopping

Parameters Weights

Internal Use - Confidential


NN - intuition

Neural networks can learn to approximate any function (with a certain cost).

The essence of supervised learning:


- Get training examples
- Find a function that approximates the real function
- Minimize the loss

Why do we not know the real function?


- Because we do not have the whole data input for a
certain phenomenon.
- Because we are unable to grasp the complexity of a
phenomenon.

Internal Use - Confidential


NN - intuition

Using hidden layers, NN have the possibility to


represent objects in a space with more dimensions
than the original space (with a certain cost).

Internal Use - Confidential


NN – bias and variance

Internal Use - Confidential


Fraud Detection Project

UDB Training data

UDE
29683 36 268
Databases
applications variables frauds
XML

model with 1 2433 AUC


Fraud Tracker
hidden layer parameters 0.9236

Internal Use - Confidential


Challenges
1. Use the same information as people do for deciding upon a fraud.

2. Select most suited variables for the model (tables with hundred of columns).

3. Internal databases were not providing quality data, so use data parsed from XML files.

4. Clean / transform data (quite messy and without explanation).

5. Highly imbalanced data. Fraud prevalence on training of 0.9%.

6. Find a model able to generalize and perform well in real conditions.

Internal Use - Confidential

You might also like