Professional Documents
Culture Documents
4. Linearity
Linear regression algorithms assume that data trends follow a straight line and work
best when this is the case. E.g. logistic regression and support vector machines
For non-linear data need to handle high dimensional and complex data structures. E.g.
kernel SVM, random forest, neural nets
5. Number of features
The dataset may have a large number of features that may not all be relevant and
significant. E.g. Text analytics, the number of features can be very large compared to
the number of data points
SVM is better suited in case of data with large feature space and lesser observations.
PCA and feature selection techniques should be used to reduce dimensionality and
select important features.
Confusion Matrix
Smog prediction system: High concern with having low false negatives than low false positives
{ A false negative would mean not warning about a smog day when in fact it is a high smog day,
leading to health issues in the public that are unable to take precautions. A false positive means
the public would take precautionary measures when they didn’t need to.}
Sensitivity / TPR: E.g. ability of a test to correctly identify those patients with the disease. Or the
extent to which actual positives are not overlooked.
High sensitivity is very important when detecting a very serious type of infection
Specificity / TNR: Ability to correctly identify patients without the disease
High Sensitivity, Low Specificity: Those without disease are told of possibly having it
False-positive: Healthy people incorrectly identified as sick -> False Alarm
Airport Security: when ordinary items such as keys or coins get mistaken for weapons
a good quality item gets rejected
False-negative: Sick people incorrectly identified as healthy -> Bad
a poor-quality item gets accepted
PRECESION: Confidence of your true positives (prefer spam in emails in Inbox vs regular in Spam
RECALL is how sure you are that you are not missing any positives.(choose if FP is better than FN)