You are on page 1of 2

3V of data: Volume, Velocity(datasets produced every minutes), datasets(structured, textual, video,

image, audio, xml, json, sensor...)

induction: process of creating general theories from observed data

induction: from particular to general: Bottom-up induction

deduction: from general to particular: Top-down deduction

KDD: knowledge discovery in database: extraction useful knowledge from large databases

Data mining is a step of knowledge discovery in database

KDD PROCESS: SELECTION -> PREPROCESSING -> MINING -> POSTPRECESSING

Data Mining tasks:

-Predictive tasks: predict values of a particular atribute based on the values of other attributes:
Classification, regression

- Descriptive tasks: inducing patterns that summarize the underlying relationships in data

+ Clustering: subdiving a set of objects into subsets

+ Association analysis: inducing relationships among attributes

OVERFITTING:

- train data is noisy: instance erroenously classified

- a hypothesis that exactly fits the training data may be wrong and have bad generation capabilities

- Overfitting occurs when the odel is too tailored over the training data -> reflect its contingent
properties rather than its structural properties

h: overfit the training data: exist h'

+: h has smaller error over the training data

+ h' has a smaller error over the unseen data

- METRICS FOR PERFORMANCE EVALUATION ERROR AND ACCURACY:

Accuracy = (TP + TN) / N where N = TP + TN + FN + FP


ERROR: (FP + FN) / N = 1 - Accuracy

Precision = TP / (TP + FP)

ENTROPY:

can be used as a measure of the degree of disorder of the training

all examples in the same class: entropy is zero

evently distributed, the entropy is maximum

INFORMATION GAIN

the expected reduction in entropy caused by partitioning the examples

the more discriminating A the higher IG

PERFORMANCE OF LINEAR REGRESSION MODEL: mean_absolute_error(MAE) and


mean_squared_error(MSE)

PERFORMANCE OF LOGISTIC REGRESSION MODEL: accuracy, precision, recall, F1 score

the loss: difference between actual and predicted values -> distance measure

loss function for regression: mean_absolute_error, mean_squared_error

SUPERVISED LEARNING: learning algorithm from a training dataset, infering a model from labeled
training data

UNSUPERVISED LEARNING: modeling the underlying or hidden structure in the unlabled data, only have
input data no corresponding output variables - clustering(ex: self-organizing maps)

You might also like