You are on page 1of 6

Data Analysis, Data Collection

Sampling and Preprocessing

Data Analysis
Data Analysis concerns methods, procedures and strategies for exploring, organizing, and
describing data using graphs and numerical summaries. Only organized data can illuminate
reality. Only thoughtful exploration of data can defeat the lurking variable.

Types of Data Sources

Primary data - one of the most important sources of data.


- directly from first-hand experience.
- an original or novel material.

The secondary data - second hand information


- already collected or surveyed by someone.

Types of Data Elements

It is imperative to appropriately consider the different types of data elements at the


start of the analysis. The following types of data elements can be considered

Continuous data - is a type of data elements that defined on an interval scale


- can be limited or unlimited.

Categorical data - yield nonnumerical information


- are examples of qualitative data.

● Nominal: can only take on a limited set of values with no meaningful ordering in
between.
● Ordinal: can only take a limited set of values with a meaningful ordering in between.
● Binary: can only take on two values.

Suitably distinguishing between these different data elements is of key importance to start
the analysis when importing the data into an analytics tool.

Visual Data Exploration and Exploratory Statistical Analysis

Visual data exploration is a very significant part of getting to know your data in an
“informal” way. It permits you to get some initial understanding into the data, which can
then be usefully accepted throughout the modeling.

Missing Values

Missing values can occur because of various reasons. The information can be inapplicable.

Some analytical techniques (e.g., decision trees) can directly deal with missing values. Other
techniques need some additional preprocessing.
Data Analysis, Data Collection
Sampling and Preprocessing

Replace (impute). This implies replacing the missing value with a known value. One could
impute the missing credit bureau score with the average or median of the known values.

Delete. This is the most straightforward option and consists of deleting observations or
variables with lots of missing values. This, of course, assumes that information is missing at
random and has no meaningful interpretation and/ or relationship to the target

Keep. Missing values can be meaningful. Obviously, this is clearly related to the target and
needs to be considered as a separate category.

On the other hand, a practical way of working, one can first start with statistically testing
whether missing information is related to the target variable (also known dependent variable)
using, for example, a chi-squared test.

Outlier Detection and Treatment


Outliers are extreme observations that are very dissimilar to the rest of the population.
Essentially, two types of outliers that can be considered:
1. Valid observations (e.g., salary of company president is Php 1 million)
2. Invalid observations (e.g., age is 300 years)

Both are univariate outliers in the sense that they are outlying on one dimension.
However, outliers can be hidden in unidimensional views of the data. Multivariate outliers are
observations that are outlying in multiple dimensions. Likewise two important steps in
dealing with outliers are detection and treatment. A first obvious check for outliers is to
calculate the minimum and maximum values for each data element.

Sampling
The aim of sampling is to take a subset of past data and use that to build an analytical
model.

Simple random sampling - is a technique used to select a sample of items from a


population in such a way that each member of the population is chosen strictly and
equally,

Systematic sampling - technique involves the selection of every kth item in the
population.

Stratified random sampling - is employed by first dividing the population into


subpopulation called strata—homogeneous collections of items (individuals,
respondents, things). Then, many simple random samples are taken—one within each
stratum—and combined to comprise the sample. Likewise in stratified sampling, the strata
are frequently sampled in proportion to their size, which is called proportional
allocation.
Data Analysis, Data Collection
Sampling and Preprocessing
Data Transformation

Transformation Defined:
- is the presentation of a model function which is not random to every point in a
data set.
- Each data point is changed with the transformed value where is a function.

Reasons for transforming data:


- satisfy the assumptions of normality, linearity, homogeneity of variance and
etc.
- make units of variables equivalent when measured on different scales.
- have different range of values e.g. trust in government scale form 1-10 but political
efficacy is from 1-4.

Transforming Percent, Proportion and Probability


The transform and the transform are the two common methods in transforming percent,
proportion and probability. The percentage should be changed first into proportion by
dividing the percentage by 100. These transformations are appropriate only to
percentages that range from 0 to 100.

Furthermore, the rule of thumb to use the transformation is that a number of proportions are
close to zero or close to one. The transformation will extend “stretch out” the proportions
that are close to 0 and 1 while it will be “compress” the proportions close to 0.5.
Data Analysis, Data Collection
Sampling and Preprocessing
Data Analysis, Data Collection
Sampling and Preprocessing
Data Analysis, Data Collection
Sampling and Preprocessing

You might also like