Professional Documents
Culture Documents
Data Analysis
Data Analysis concerns methods, procedures and strategies for exploring, organizing, and
describing data using graphs and numerical summaries. Only organized data can illuminate
reality. Only thoughtful exploration of data can defeat the lurking variable.
● Nominal: can only take on a limited set of values with no meaningful ordering in
between.
● Ordinal: can only take a limited set of values with a meaningful ordering in between.
● Binary: can only take on two values.
Suitably distinguishing between these different data elements is of key importance to start
the analysis when importing the data into an analytics tool.
Visual data exploration is a very significant part of getting to know your data in an
“informal” way. It permits you to get some initial understanding into the data, which can
then be usefully accepted throughout the modeling.
Missing Values
Missing values can occur because of various reasons. The information can be inapplicable.
Some analytical techniques (e.g., decision trees) can directly deal with missing values. Other
techniques need some additional preprocessing.
Data Analysis, Data Collection
Sampling and Preprocessing
Replace (impute). This implies replacing the missing value with a known value. One could
impute the missing credit bureau score with the average or median of the known values.
Delete. This is the most straightforward option and consists of deleting observations or
variables with lots of missing values. This, of course, assumes that information is missing at
random and has no meaningful interpretation and/ or relationship to the target
Keep. Missing values can be meaningful. Obviously, this is clearly related to the target and
needs to be considered as a separate category.
On the other hand, a practical way of working, one can first start with statistically testing
whether missing information is related to the target variable (also known dependent variable)
using, for example, a chi-squared test.
Both are univariate outliers in the sense that they are outlying on one dimension.
However, outliers can be hidden in unidimensional views of the data. Multivariate outliers are
observations that are outlying in multiple dimensions. Likewise two important steps in
dealing with outliers are detection and treatment. A first obvious check for outliers is to
calculate the minimum and maximum values for each data element.
Sampling
The aim of sampling is to take a subset of past data and use that to build an analytical
model.
Systematic sampling - technique involves the selection of every kth item in the
population.
Transformation Defined:
- is the presentation of a model function which is not random to every point in a
data set.
- Each data point is changed with the transformed value where is a function.
Furthermore, the rule of thumb to use the transformation is that a number of proportions are
close to zero or close to one. The transformation will extend “stretch out” the proportions
that are close to 0 and 1 while it will be “compress” the proportions close to 0.5.
Data Analysis, Data Collection
Sampling and Preprocessing
Data Analysis, Data Collection
Sampling and Preprocessing
Data Analysis, Data Collection
Sampling and Preprocessing