You are on page 1of 3

Data Reduction:

Simplify the dataset to focus on variables or constructs that carry more meaning and separate it from
noise. Here we generally are talking about reduction of variables or fields (as opposed to observations).
Possible reasons:

- Storage (Hard drive)

- Memory (RAM)
- Time
- Reduce noise / distractions
- Focus on patterns
- Easier to interpret

Analogy is of projecting a shadow, taking data from a high dimensional space (each variable from the
dataset is a dimension) and projecting a shadow to a lower dimensional space. Think of taking a three
dimensional object and projecting a shadow on a two dimensional space and still be able to tell what it
is. One of the ways of doing it is PCA (Principal Component Analysis). Tools used may be:

- R
- Python
- Orange
- Rapid Miner

Clustering:

Idea is to put the entire set of observations or cases so that “like goes with like”. This is a grouping of
convenience rather than some sort of natural/universal grouping. We group the cases so that it
accomplishes a specific purpose. For example, in marketing similar customers are grouped together for
offers. Clusters are pragmatic groupings to serve a particular purpose.

- Distance between points:

o Measure distance from every point to every point
o Cons: Applicable only on convex clusters, very slow for big data
- Distance from a centroid
o K-means
- Density of data
- Distribution models

Classification:

Choosing the right bucket for data. Examples:

- Spam filters
- Fraud detection
- Genetic testing

Classification complements clustering. Clustering creates buckets and classification puts new cases into
them. Algorithms used for classification:

- K-nearest neighbors (k-NN)

- Naïve Bayes
- Decision trees
- Random forests
- Support vector machines (SVM)
- Artificial neural networks (ANN)
- K-means
- Logistic regression

Anomaly Detection:

Anomalies distort the statistics, correlations, etc. We have a few ways around it:

- Deleting them, but making sure this does not nullify analysis
- Transform (log, squares, etc., to make distribution symmetrical)
- Robust (use methods that are not strongly influenced by anomalies like median over mean, etc.)

Association Analysis:

- Powerful method of finding associations (items that go together)

- Able to get probability of an item (or set of items) based on the presence of another item (or set
of items)

This may be used on a purchasing website where associated items may be shown to customers.
Packages in R: arules, arulesViz.

Regression Analysis:

Use many variables to predict one. Example is of least squares regression (the assumption here is that
the data is following normal distribution).

Correlated predictors: Multicollinearity when the predicted variables are associated with each other:
Sequence Mining:

Sequence mining is like association analysis but the sequence/order of events matters here. Examples
are recommendation engines (if a person does a and b, then he is likely to do c…)

Text Mining:

Unlike other types this is unstructured data (instead of rows and columns of numeric data); here we
have a blob of text. E.g.,

- accessing authorship and voice

- Sentiment analysis for social media (figuring if people are saying good or bad about something