You are on page 1of 39

Self Introduction

Latest Happening in Data Science


By
Manikandan
ML Fairness

AutoML
Outline

Domain Experts in DS
Fairness of Machine Learning Model (FairML)
Mention of ML Fairness in Research Papers
Thoughts ?

What if I told you Computers can treat you unfair ?

Would you believe me ?


Google Translation In Action
Commercial Gender Image Classification
There was interesting paper submitted about Gender Shades: Intersectional Accuracy Disparities in Commercial Gender
Classification which reveals bias in the commercial algorithms.
Microsoft's twitter-based AI Chabot Tay
XING, a job platform similar to Linked-in
The list goes on………..
Bias in ML has been almost inevitable when the
application is involved in people.
It has already hurt the benefit of people in minority
groups or historically disadvantageous groups.
If no one cares, it is highly likely that the next person
who suffers from biased treatment is one of us.
Definition of Fairness

● Group Fairness
Partitions a population into groups defined by protected
attributes(such as gender, caste, or religion) and seeks for some
statistical measure to be equal across groups.

● Individual Fairness
similar individuals should be treated similarly.
ML Unfairness - Causes (Data)

● Skewed sample
● Tainted examples
● Limited features
● Sample size disparity
● Proxies
Difficulties in ensuring ML Algorithm is Fair
Interpretable Machine Learning
IML Benefits

 Fairness: Ensuring that predictions are unbiased and do not implicitly or explicitly
discriminate against protected groups. An interpretable model can tell you why it has
decided that a certain person should not get a loan, and it becomes easier for a human
to judge whether the decision is based on a learned demographic (e.g. racial) bias.
 Privacy: Ensuring that sensitive information in the data is protected.
 Reliability or Robustness: Ensuring that small changes in the input do not lead to
large changes in the prediction.
 Trust: It is easier for humans to trust a system that explains its decisions compared to a
black box.
IML Architecture
Preferred Explaining - Model Interpretation
Way to go….
Explainability and Fairness - Just one `pip` away
 lime - https://github.com/marcotcr/lime
 shap - https://github.com/slundberg/shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
 eli5 - https://github.com/TeamHG-Memex/eli5
 scikit-lego - https://github.com/koaning/scikit-lego
from sklego.preprocessing import InformationFilter
from sklego.linear_model import FairClassifier
 What-if Tool - https://pair-code.github.io/what-if-tool/
 Captum - https://github.com/pytorch/captum
Is only the organization haivng the protected
data being responsible for bringing the digital
fairness?
Off course not.

Government also need to bring in proper Data


Regulations to avoid using the Personal &
Protected data.
Do we have any Data regulation in the Word ?

Yes, GDPR in Europe Union


What changes that GDPR enforced ?
GDPR in Action
Automation of Machine Learning (AutoML)
Team Data Science Process lifecycle

The Team Data Science


Process (TDSP) is an
agile, iterative data
science methodology
to deliver predictive
analytics solutions and
intelligent applications
efficiently.
Roles & Responsibilities associated with Lifecycle
Automated Machine Learning( AutoML)

What Wikipedia says…

 AutoML is the process of automating end-to-end the process of


applying machine learning to real-world problems.

 In a typical machine learning application, practitioners would do

 Data pre-processing
 Feature engineering
 Feature extraction
 Feature selection
 Algorithm selection
 Hyperparameter optimization
 Validation

 As many of these steps are often beyond the abilities of non-experts, AutoML
was proposed as an artificial intelligence-based solution to the ever-growing
challenge of applying machine learning.
Targets of AutoML

1) Automated data preparation and ingestion (from raw data and miscellaneous
formats)
 Automated column type detection; e.g., boolean, continuous, or text
 Automated column intent detection; e.g., target/label
 Automated task detection; e.g., binary classification, regression, clustering.
2) Automated feature engineering
 Feature selection
 Feature extraction
 Detection and handling of skewed data and/or missing values
3) Automated model selection
4) Hyperparameter optimization of the learning algorithm and featurization
5) Automated selection of evaluation metrics / validation procedures
6) Automated analysis of results obtained
7) User interfaces and visualizations for automated machine learning

Advantages of AutoML

 Increases productivity by automating repetitive tasks. This enables a data scientist to focus more on the problem rather
than the models.
 Automating the ML pipeline also helps to avoid errors that might creep in manually.
 Ultimately, AutoML is a step towards democratizing machine learning by making the power of ML accessible to everybody.
AutoML Frameworks

MLBox

MLBox is a powerful automated machine learning Python library.


According to the official documentation, this library provides the
following features:

 Fast reading and distributed data preprocessing/cleaning/formatting.


 Highly robust feature selection, leak detection, and accurate
hyperparameter optimization
 State-of-the-art predictive models for classification and regression
(Deep Learning, Stacking, LightGBM,…)
 Prediction with model interpretation
 It has already been tested on Kaggle and performs well.

Compatibilities:

 Operating systems: Linux, MacOS & Windows.


 Python versions: 3.5 - 3.7. & 64-bit version only (32-bit
python is not supported)
Auto-Sklearn

 Auto-Sklearn is an automated machine learning package


built on top of Scikit-learn.
 Auto-sklearn frees a machine learning user from
algorithm selection and hyperparameter tuning.
 It includes feature engineering methods such as one-
hot encoding, numeric feature standardization, PCA,
and more.
 Auto-sklearn performs well on small and medium-sized
datasets, but it cannot be applied to modern deep
learning systems that yield state-of-the-art performance
on large datasets.

Compatibilities:

 Operating systems: Linux


 Python (>=3.5)
 C++ compiler (with C++11 supports)
 SWIG (version 3.0 or later)
Tree-Based Pipeline Optimization Tool (TPOT)

 TPOT is a Python automated machine learning tool


that optimizes machine learning pipelines using
genetic programming.
 TPOT extends the Scikit-learn framework but with
its own regressor and classifier methods. TPOT is
built on top of scikit-learn, so all of the code it
generates should look familiar... if you're familiar
with scikit-learn.
 TPOT works by exploring thousands of possible
pipelines and finding the best one for your data. So
we it will run a while to run for large dataset.
 TPOT cannot automatically process natural
language inputs. Additionally, it’s also not able to
processes categorical strings, which must be
integer-encoded before being passed in as data.
 TPOT is built on top of several existing Python
libraries, including:
NumPy, SciPy, scikit-learn, DEAP, update_checker,
tqdm, stopit, pandas, joblib
 We also strongly recommend that you use of
Python 3 over Python 2 if you're given the choice.
H2O AutoML

 H2O is a fully open source, distributed in-memory


machine learning platform from the
company H2O.ai.
 With support for both R and Python, H2O supports
the most widely used statistical & machine learning
algorithms, including gradient boosted machines,
generalized linear models, deep learning models,
and more.
 H2O includes an automatic machine learning
module that uses its own algorithms to build a
pipeline. It performs an exhaustive search over its
feature engineering methods and model
hyperparameters to optimize its pipelines.
 H2O automates some of the most difficult data
science and machine learning workflows, such as
feature engineering, model validation, model
tuning, model selection and model deployment.
 In addition to this, it also offers automatic
visualizations and machine learning interpretability
(MLI).
If all the Feature Engineering, Model building &
Model Fine Tuning are automated, then what’s
the scope of Data Science Expert ?
“The need of the hour today is marrying academic
elegance with business domain knowledge. It is
the time for bilingual people who speak the
business lingo and have sound data science
concepts”

Great Learning’s Dr PK Vishawanathan in Cypher 2019


Inspiration
 https://in.pycon.org/cfp/2019/proposals/machine-learning-bias~e1Aje/
 https://towardsdatascience.com/interpretable-machine-learning-1dec0f2f3e6b
 https://heartbeat.fritz.ai/automl-the-next-wave-of-machine-learning-5494baac615f
 https://arxiv.org/pdf/1808.06492v1.pdf
 https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-
process/overview
 https://automl.github.io/auto-sklearn/master/#
 https://towardsdatascience.com/a-tutorial-on-fairness-in-machine-learning-
3ff8ba1040cb
Thank You

By
Manikandan
Gmail - nmani1191@gmail.com
LinkedIn - www.linkedin.com/in/manikandan1191

You might also like