You are on page 1of 1

Choosing an Algorithm

1. Size of the training data


 Small data size w/ higher number of features: Algos with high bias, low variance
 Linear regression, Naïve Bayes, or Linear SVM.

2. Accuracy and/or Interpretability of the output


 Restrictive models -> Highly Interpretable E.g. Linear Regression
a. easily understand how a individual predictor is associated with the response
 Flexible models -> High accuracy at the cost of low interpretability E.g. k-NN
 If inference is the goal, then restrictive models are better as they are much more
interpretable. Flexible models are better if higher accuracy is the goal.

3. Speed or Training time


 Higher accuracy | large training data -> higher training time
 Naïve Bayes, Linear and Logistic regression are easy to implement and quick to run
 SVM, which involve tuning of parameters, Neural networks with high convergence time,
and random forests, need a lot of time to train the data

4. Linearity
 Linear regression algorithms assume that data trends follow a straight line and work
best when this is the case. E.g. logistic regression and support vector machines
 For non-linear data need to handle high dimensional and complex data structures. E.g.
kernel SVM, random forest, neural nets

5. Number of features
 The dataset may have a large number of features that may not all be relevant and
significant. E.g. Text analytics, the number of features can be very large compared to
the number of data points
 SVM is better suited in case of data with large feature space and lesser observations.
PCA and feature selection techniques should be used to reduce dimensionality and
select important features.
Confusion Matrix
Smog prediction system: High concern with having low false negatives than low false positives
{ A false negative would mean not warning about a smog day when in fact it is a high smog day,
leading to health issues in the public that are unable to take precautions. A false positive means
the public would take precautionary measures when they didn’t need to.}
Sensitivity / TPR: E.g. ability of a test to correctly identify those patients with the disease. Or the
extent to which actual positives are not overlooked.
High sensitivity is very important when detecting a very serious type of infection
Specificity / TNR: Ability to correctly identify patients without the disease
High Sensitivity, Low Specificity: Those without disease are told of possibly having it
False-positive: Healthy people incorrectly identified as sick -> False Alarm
 Airport Security: when ordinary items such as keys or coins get mistaken for weapons
 a good quality item gets rejected
False-negative: Sick people incorrectly identified as healthy -> Bad
 a poor-quality item gets accepted
PRECESION: Confidence of your true positives (prefer spam in emails in Inbox vs regular in Spam
RECALL is how sure you are that you are not missing any positives.(choose if FP is better than FN)

You might also like