You are on page 1of 26

What is Data Preparation?

Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It
is an important step prior to processing and often involves reformatting data, making corrections to data
and the combining of data sets to enrich data.
Data preparation is often a lengthy undertaking for data professionals or business users, but it is
essential as a prerequisite to put data in context in order to turn it into insights and eliminate bias
resulting from poor data quality.
For example, the data preparation process usually includes standardizing data formats, enriching
source data, and/or removing outliers.

Benefits of Data Preparation + The Cloud

76% of data scientists say that data preparation is the worst part of their job, but the efficient, accurate
business decisions can only be made with clean data. Data preparation helps:

1. Fix errors quickly — Data preparation helps catch errors before processing. After data has been
removed from its original source, these errors become more difficult to understand and correct.
2. Produce top-quality data — Cleaning and reformatting datasets ensures that all data used in
analysis will be high quality.
3. Make better business decisions — Higher quality data that can be processed and analyzed more
quickly and efficiently leads to more timely, efficient and high-quality business decisions.

Additionally, as data and data processes move to the cloud, data preparation moves with it for even
greater benefits, such as:

1. Superior scalability — Cloud data preparation can grow at the pace of the business. Enterprise
don’t have to worry about the underlying infrastructure or try to anticipate their evolutions.
2. Future proof — Cloud data preparation upgrades automatically so that new capabilities or
problem fixes can be turned on as soon as they are released. This allows organizations to stay
ahead of the innovation curve without delays and added costs.
3. Accelerated data usage and collaboration — Doing data prep in the cloud means it is always
on, doesn’t require any technical installation, and lets teams collaborate on the work for faster
results.

Additionally, a good, cloud-native data preparation tool will offer other benefits (like an intuitive and
simple to use GUI) for easier and more efficient preparation.
Data Preparation Steps
The specifics of the data preparation process vary by industry, organization
and need, but the framework remains largely the same.
1. Gather data:- The data preparation process begins with finding the right data. This can come
from an existing data catalog or can be added ad-hoc.

2. Discover and assess data :-After collecting the data, it is important to discover each dataset.
This step is about getting to know the data and understanding what has to be done before the data
becomes useful in a particular context.

3. Cleanse and validate data:- Cleaning up the data is traditionally the most time consuming
part of the data preparation process, but it’s crucial for removing faulty data and filling in gaps.
Important tasks here include:
 Removing extraneous data and outliers.
 Filling in missing values.
 Conforming data to a standardized pattern.
 Masking private or sensitive data entries.

Once data has been cleansed, it must be validated by testing for errors in the data preparation process up
to this point. Often times, an error in the system will become apparent during this step and will need to be
resolved before moving forward.

4. Transform and enrich data :-Transforming data is the process of updating the format or
value entries in order to reach a well-defined outcome, or to make the data more easily understood
by a wider audience. Enriching data refers to adding and connecting data with other related
information to provide deeper insights.

5. Store data :- Once prepared, the data can be stored or channeled into a third party
application—such as a business intelligence tool—clearing the way for processing and analysis to
take place.
What is SPSS?

SPSS (Statistical package for the social sciences) is the set of software programs that are combined
together in a single package. The basic application of this program is to analyze scientific data related with
the social science. This data can be used for market research, surveys, data mining, etc.
With the help of the obtained statistical information, researchers can easily understand the demand
for a product in the market, and can change their strategy accordingly. Basically, SPSS first store and
organize the provided data, then it compiles the data set to produce suitable output. SPSS is designed in
such a way that it can handle a large set of variable data formats.

The Core Functions of SPSS

SPSS offers four programs that assist researchers with your complex data analysis needs.

1. Statistics Program: - SPSS’s Statistics program provides a plethora of basic statistical functions,
some of which include frequencies, cross-tabulation, and bivariate statistics.
2. Modeler Program: - SPSS’s Modeler program enables researchers to build and validate
predictive models using advanced statistical procedures.

3. Text Analytics for Surveys Program: - SPSS’s Text Analytics for Surveys program helps
survey administrators uncover powerful insights from responses to open-ended survey questions.

4. Visualization Designer: - SPSS’s Visualization Designer program allows researchers to use their
data to create a wide variety of visuals like density charts and radial boxplots from their survey
data with ease.

In addition to the four programs mentioned above, SPSS also provides solutions for data management,
which allow researchers to perform case selection, create derived data, and perform file reshaping.

SPSS also offers data documentation, which allows researchers to store a metadata dictionary. This
metadata dictionary acts as a centralized repository of information pertaining to the data, such as
meaning, relationships to other data, origin, usage, and format.
There are a handful of statistical methods that can be leveraged in SPSS, including:

 Descriptive statistics, including methodologies such as frequencies, cross-tabulation, and descriptive


ratio statistics.
 Bivariate statistics, including methodologies such as analysis of variance (ANOVA), means,
correlation, and nonparametric tests.
 Numeral outcome prediction such as linear regression.
 Prediction for identifying groups, including methodologies such as cluster analysis and factor analysis.

Factors that are responsible in the process of data handling and its execution.
1. Data Transformation: This technique is used to convert the format of the data. After
changing the data type, it integrates same type of data in one place and it becomes easy to
manage it. You can insert the different kind of data into SPSS and it will change its
structure as per the system specification and requirement. It means that even if you change
the operating system, SPSS can still work on old data.
2. Regression Analysis: It is used to understand the relation between dependent and
interdependent variables that are stored in a data file. It also explains how a change in the
value of an interdependent variable can affect the dependent data. The primary need of
regression analysis is to understand the type of relationship between different variables.
3. ANOVA( Analysis of variance): It is a statistical approach to compare events, groups or
processes, and find out the difference between them. It can help you understand which
method is more suitable for executing a task. By looking at the result, you can find the
feasibility and effectiveness of the particular method.
4. MANOVA( Multivariate analysis of variance): This method is used to compare data of
random variables whose value is unknown. MANOVA technique can also be used to
analyze different types of population and what factors can affect their choices.
5. T-tests: It is used to understand the difference between two sample types, and researchers
apply this method to find out the difference in the interest of two kinds of groups. This test
can also understand if the produced output is meaningless or useful.

This software was developed in 1960, but later in 2009, IBM acquired it. They have made some
significant changes in the programming of SPSS and now it can perform many types of research task in
various fields. Due to this, the use of this software is extended to many industries and organizations, such
as marketing, health care, education, surveys, etc.
Creating SPSS Sheet & Entry Data in SPSS
Data Creation in SPSS :-

When you open the SPSS program, you will see a blank spreadsheet in Data View. If you already have
another dataset open but want to create a new one, click File> New > Data to open a blank
spreadsheet.

You will notice that each of the columns is labeled “var.” The column names will represent the
variables that you enter in your dataset. You will also notice that each row is labeled with a number
(“1,” “2,” and so on). The rows will represent cases that will be a part of your dataset. When you
enter values for your data in the spreadsheet cells, each value will correspond to a specific variable
(column) and a specific case (row).
Follow these steps to enter data:

1. Click the Variable View tab. Type the name for your first variable under the Name column. You
can also enter other information about the variable, such as the type (the default is “numeric”),
width, decimals, label, etc. Type the name for each variable that you plan to include in your
dataset. In this example, I will type “School_Class” since I plan to include a variable for the class
level of each student (i.e., 1 = first year, 2 = second year, 3 = third year, and 4 = fourth year). I
will also specify 0 decimals since my variable values will only include whole numbers. (The
default is two decimals.)

2. Click the Data View tab. Any variable names that you entered in Variable View will now be
included in the columns (one variable name per column). You can see that School_Class appears
in the first column in this example.

3. Now you can enter values for each case. In this example, cases represent students. For each
student, enter a value for their class level in the cell that corresponds to the appropriate row and
column. For example, the first person’s information should appear in the first row, under the
variable column School_Class. In this example, the first person’s class level is “2,” the second
person’s is “1,” the third person’s is “1,” the fourth person’s is “3,” and so on.

4. Repeat these steps for each variable that you will include in your dataset. Don't forget to
periodically save your progress as you enter data.

Inserting or Deleting Single Cases

Sometimes you may need to add new cases or delete existing cases from your dataset. For example,
perhaps you notice that one observation in your data was accidentally left out of the dataset. In that
situation, you would refer to the original data collection materials and enter the missing case into the
dataset (as well as the associated values for each variable in the dataset). Alternatively, you may realize
that you have accidentally entered the same case in your dataset more than once and need to remove the
extra case.
INSERTING A CASE

To insert a new case into a dataset:

1. In Data View, click a row number or individual cell below where you want your new row to be
inserted.
2. You can insert a case in several ways:

 Click Edit > Insert Cases;


 Right-click on a row and select Insert Cases from the menu; or

 Click the Insert Cases icon ( ).


3. A new, blank row will appear above the row or cell you selected. Values for each existing
variable in your dataset will be missing (indicated by either a “.” or a blank cell) for your newly
created case since you have not yet entered this information.

4. Type in the values for each variable in the new case.


DELETING A CASE

To delete an existing case from a dataset:

1. In the Data View tab, click the case number (row) that you wish to delete. This will
highlight the row for the case you selected.
2. Press Delete on your keyboard, or right-click on the case number and select
“Clear”. This will remove the entire row from the dataset.

Inserting or Deleting Single Variables

Sometimes you may need to add new variables or delete existing variables from your dataset. For
example, perhaps you are in the process of creating a new dataset and you must add many new variables
to your growing dataset. Alternatively, perhaps you decide that some variables are not very useful to your
study and you decide to delete them from the dataset. Or, similarly, perhaps you are creating a smaller
dataset from a very large dataset in order to make the dataset more manageable for a research project that
will only use a subset of the existing variables in the larger dataset.

INSERTING A VARIABLE

To insert a new variable into a dataset:

1. In the Data View window, click the name of the column to the right of of where you want
your new variable to be inserted.
2. You can now insert a variable in several ways:

 Click Edit > Insert Variable;


 Right-click an existing variable name and click Insert Variable; or

 Click the Insert Variable icon ( ).

A new, blank column will appear to the left of the column or cell you selected.
New variables will be given a generic name (e.g. VAR00001). You can enter a new name for the
variable on the Variable View tab. You can quick-jump to the Variable View screen by double-
clicking on the generic variable name at the top of the column. Once in the Variable View, under
the column “Name,” type a new name for the variable name you wish to change. You should also
define the variable's other properties (type, label, values, etc.) at this time.

All values for the newly created variable will be missing (indicated by a “.” in each cell in Data
View, by default) since you have not yet entered any values. You can enter values for the new
variable by clicking the cells in the column and typing the values associated with each case
(row).
Tip:
Is it possible to insert a variable using syntax? Technically, there's no direct syntax command to do so.
Instead, you'll need to use two syntax commands. You'll first use the COMPUTE command to initialize the
new variable. You'll then use the MATCH FILES command to actually re-order the variables. Suppose we
want to insert a new column of blank values into the sample dataset after the first variable, ids. We can
use this syntax to perform these tasks:

/*Compute new variable containing blanks (system-missing values).*/


COMPUTE newvar=$SYSMIS.
EXECUTE.

/*Reorder the variables to place the new variable in the desired position.*/
MATCH FILES
FILE = *
/KEEP = ids newvar ALL.

In the MATCH FILES command, FILE=* says to act on the the current active dataset.
The /KEEP statement tells SPSS the specific order of the variables you want: we list the variables by
name, in the order we want, separated by spaces, on the right side of the equals sign. The ALL option at
the end of the line says to retain all remaining variables in their current order. The ALL option can only
be used at the end of the line; the code will fail if you try to put it before other variable names. If we do
not include ALL, SPSS will throw out any variables not named in the /KEEP statement.

DELETING A VARIABLE
To delete an existing variable from a dataset:

1. In the Data View tab, click the column name (variable) that you wish to delete. This will
highlight the variable column.
2. Press Delete on your keyboard, or right-click on the selected variable and click “Clear.” The
variable and associated values will be removed.

Alternatively, you can delete a variable through the Variable View window:

1. Click on the row number corresponding to the variable you wish to delete. This will highlight the
row.
2. Press Delete on your keyboard, or right-click on the row number corresponding to the variable
you wish to delete and click "Clear".

You can also delete variables using command syntax.

/*Delete one variable.*/


DELETE VARIABLES var1.

/*Delete several variables.*/


DELETE VARIABLES var1 var2 var3.

ID Variables versus Row Numbers


Now that you know how to enter data, it is important to discuss a special type of variable called an ID
variable. When data are collected, each piece of information is tied to a particular case. For example,
perhaps you distributed a survey as part of your data collection, and each survey was labeled with a
number (“1,” “2,” etc.). In this example, the survey numbers essentially represent ID numbers: numbers
that help you identify which pieces of information go with which respondents in your sample. Without
these ID numbers, you would have no way of tracking which information goes with which respondent,
and it would be impossible to enter the data accurately into SPSS.

When you enter data into SPSS, you will need to make sure that you are entering values for each variable
that correspond to the correct person or object in your sample. It might seem like a simple solution to use
the conveniently labeled rows in SPSS as ID numbers; you could enter your first respondent’s
information in the row that is already labeled “1,” the second respondent’s information in the row labeled
“2,” etc. However, you should never rely on these pre-numbered rows for keeping track of the specific
respondents in your sample. This is because the numbers for each row are visual guides only—they are
not attached to specific lines of data, and thus cannot be used to identify specific cases in your data. If
your data become rearranged (e.g., after sorting data), the row numbers will no longer be associated with
the same case as when you first entered the data. Again, the row numbers in SPSS are not attached to
specific lines of data and should not be used to identify certain cases. Instead, you should create a
variable in your dataset that is used to identify each case—for example, a variable called StudentID.

Here is an example that illustrates why using the row numbers in SPSS as case identifiers is flawed:

Let’s say that you have entered values for each person for the School_Class variable. You relied on the
row numbers in SPSS to correspond to your survey ID numbers. Thus, for survey #1, you entered the first
respondent’s information in row 1, for survey #2 you entered the second person’s information in row 2,
and so on. Now you have entered all of your data.

But suppose the data get rearranged in the spreadsheet view. A common way of rearranging data is by
sorting—and you may very well need to do this as you explore and analyze your data. Sorting will
rearrange the rows of data so that the values appear in ascending or descending order. If you right-click
on any variable name, you can select “Sort Ascending” or “Sort Descending.” In the example below, the
data are sorted in ascending order on the values for the variable School_Class.
But what happens if you need to view a specific respondent’s information? Or perhaps you need to
double-check your entry of the data by comparing the original survey to the values you entered in SPSS.
Now that the data have been rearranged, there is no way to identify which row corresponds to which
participant/survey number.

The main point is that you should not rely on the row numbers in SPSS since they are merely visual
guides and not part of your data. Instead, you should create a specific variable that will serve as an ID for
each case so that you can always identify certain cases in your data, no matter how much you rearrange
the data. In the sample data file, the variable ids acts as the ID variable.

Tip
If you do not have an ID variable in your dataset, a convenient way to generate one is to use the system
variable $CASENUM. You can use the Compute Variables procedure (simply enter $CASENUM in the
Numeric Expression box), or by running the following syntax after all of your data has been entered:

COMPUTE id=$CASENUM.
EXECUTE.
Understanding Descriptive
Statistics
Statistics is a branch of mathematics that deals with collecting, interpreting, organization, and

interpretation of data.

Initially, when we get the data, instead of applying fancy algorithms and making some predictions, we first

try to read and understand the data by applying statistical techniques. By doing this, we are able to

understand what type of distribution data has.

This blog aims to answer the following questions:

1. What is Descriptive Statistics?

2. Types of Descriptive Statistics?

3. Measure of Central Tendency (Mean, Median, Mode)

4. Measure of Spread / Dispersion (Standard Deviation, Mean Deviation, Variance, Percentile, Quartiles,

Interquartile Range)

5. What is Skewness?

6. What is Kurtosis?

7. What is Correlation?
Today, let’s understand descriptive statistics once and for all. Let’s start,

What is Descriptive Statistics?

Descriptive statistics involves summarizing and organizing the data so they can be easily understood.

Descriptive statistics, unlike inferential statistics, seeks to describe the data, but does not attempt to make

inferences from the sample to the whole population. Here, we typically describe the data in a sample. This

generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of

probability theory.

Types of Descriptive Statistics?

Descriptive statistics are broken down into two categories. Measures of central tendency and measures of

variability (spread).

The measure of Central Tendency

Central tendency refers to the idea that there is one number that best summarizes the entire set of

measurements, a number that is in some way “central” to the set.

Mean / Average

Mean or Average is a central tendency of the data i.e. a number around which a whole data is spread out.

In a way, it is a single number that can estimate the value of the whole data set.

Let’s calculate the mean of the data set having 8 integers.


Median

Median is the value that divides the data into 2 equal parts i.e. number of terms on the right side of it is the

same as a number of terms on the left side of it when data is arranged in either ascending or descending

order.

Note: If you sort data in descending order, it won’t affect the median but IQR will be negative. We will

talk about IQR later in this blog.

Median will be a middle term if the number of terms is odd

Median will be the average of the middle 2 terms if a number of terms is even.

Image 3

The median is 59 which will divide a set of numbers into equal two parts. Since there are even numbers in

the set, the answer is the average of middle numbers 51 and 67.

Note: When values are in arithmetic progression (difference between the consecutive terms is constant.

Here it is 2.), the median is always equal to the mean.

Image 4

The mean of these 5 numbers is 6 and so median.

Mode

Mode is the term appearing maximum time in data set i.e. term that has the highest frequency.
Image 5

In this data set, the mode is 67 because it has more than the rest of the values, i.e. twice.

But there could be a data set where there is no mode at all as all values appear same number of times. If

two values appeared same time and more than the rest of the values then the data set is bimodal. If three

values appeared same time and more than the rest of the values then the data set is trimodal and for n

modes, that data set is multimodal.

Measure of Spread / Dispersion

Measure of Spread refers to the idea of variability within your data.

Standard deviation

Standard deviation is the measurement of the average distance between each quantity and mean. That is,

how data is spread out from the mean. A low standard deviation indicates that the data points tend to be

close to the mean of the data set, while a high standard deviation indicates that the data points are spread

out over a wider range of values.

There are situations when we have to choose between sample or population Standard Deviation.

When we are asked to find SD of some part of a population, a segment of population; then we use sample

Standard Deviation.

Image 6

where x̅ is mean of a sample.


But when we have to deal with a whole population, then we use population Standard Deviation.

Image 7

where µ is mean of a population.

Though sample is a part of a population, their SD formulas should have been same, but it is not. To find

out more about it, refer this link

As you know, in descriptive statistics, we generally deal with data available in a sample, not in a

population. So if we use the previous data set, and substitute the values in the sample formula,

Image 8

And the answer would be 29.62.

Mean Deviation / Mean Absolute Deviation

It is an average of absolute differences between each value in a set of values, and the average of all values

of that set.
So if we use the previous data set, and substitute the values,

Image 10

And the answer would be 23.75.

Variance

Variance is a square of average distance between each quantity and mean. That is it is square of standard

deviation.

Image 11

And the answer would be 877.34.

Range

Range is one of the simplest techniques of descriptive statistics. It is the difference between the lowest and

highest value.

Image 12

Range is 99–12 = 87

Percentile

Percentile is a way to represent position of a values in data set. To calculate percentile, values in data set

should always be in ascending order.


Image 13

The median 59 has 4 values less than itself out of 8. It can also be said as: In data set, 59 is 50th percentile

because 50% of the total terms are less than 59. In general, if k is nth percentile, it implies that n% of the

total terms are less than k.

Quartiles

In statistics and probability, quartiles are values that divide your data into quarters provided data is sorted

in an ascending order.

Quartiles [Image 14] (Image courtesy: https://statsmethods.wordpress.com/2013/05/09/iqr/)

There are three quartile values. First quartile value is at 25 percentile. Second quartile is 50 percentile and
the third quartile is 75 percentile. Second quartile (Q2) is median of the whole data. First quartile (Q1) is

median of upper half of the data. And Third Quartile (Q3) is median of lower half of the data.

So here, by analogy,

Q2 = 67: is 50 percentile of the whole data and is median.

Q1 = 41: is 25 percentile of the data.


Q3 = 85: is 75 percentile of the date.

Interquartile range (IQR) = Q3 - Q1 = 85 - 41 = 44

Note: If you sort data in descending order, IQR will be -44. The magnitude will be same, just sign will

differ. Negative IQR is fine, if your data is in descending order. It just we negate smaller values from

larger values, we prefer ascending order (Q3 - Q1).

Skewness

Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable

about its mean. The skewness value can be positive or negative, or undefined.

In a perfect normal distribution, the tails on either side of the curve are exact mirror images of each other.

When a distribution is skewed to the left, the tail on the curve’s left-hand side is longer than the tail on the

right-hand side, and the mean is less than the mode. This situation is also called negative skewness.

When a distribution is skewed to the right, the tail on the curve’s right-hand side is longer than the tail on

the left-hand side, and the mean is greater than the mode. This situation is also called positive skewness.

Skewness [Image 16] (Image courtesy: https://www.safaribooksonline.com/library/view/clojure-for-


data/9781784397180/ch01s13.html)
How to the skewness coefficient?

To calculate skewness coefficient of the sample, there are two methods:

1] Pearson First Coefficient of Skewness (Mode skewness)

Image 17

2] Pearson Second Coefficient of Skewness (Median skewness)

Image 18

Interpretations

 The direction of skewness is given by the sign. A zero means no skewness at all.

 A negative value means the distribution is negatively skewed. A positive value means the

distribution is positively skewed.

 The coefficient compares the sample distribution with a normal distribution. The larger the

value, the larger the distribution differs from a normal distribution.

Sample problem: Use Pearson’s Coefficient #1 and #2 to find the skewness for data with the following

characteristics:

 Mean = 50.
 Median = 56.

 Mode = 60.

 Standard deviation = 8.5.

Pearson’s First Coefficient of Skewness: -1.17.

Pearson’s Second Coefficient of Skewness: -2.117.

Note: Pearson’s first coefficient of skewness uses the mode. Therefore, if frequency of values is very low

then it will not give a stable measure of central tendency. For example, the mode in both these sets of data

is 9:

1, 2, 3, 4, 4, 5, 6, 7, 8, 9.

In the first set of data, the mode only appears twice. So it is not a good idea to use Pearson’s First

Coefficient of Skewness. But in the second set,

1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 10, 12, 12, 13.

mode 4 appears 8 times. Therefore, Pearson’s Second Coefficient of Skewness will likely give you a

reasonable result.

Kurtosis

The exact interpretation of the measure of Kurtosis used to be disputed but is now settled. It's about the

existence of outliers. Kurtosis is a measure of whether the data are heavy-tailed (profusion of outliers) or

light-tailed (lack of outliers) relative to a normal distribution.


There are three types of Kurtosis

Mesokurtic

Mesokurtic is the distribution that has similar kurtosis as normal distribution kurtosis, which is zero.

Leptokurtic

Distribution is the distribution that has kurtosis greater than a Mesokurtic distribution. Tails of such

distributions are thick and heavy. If the curve of distribution is more peaked than the Mesokurtic curve, it

is referred to as a Leptokurtic curve.

Platykurtic

Distribution is the distribution that has kurtosis lesser than a Mesokurtic distribution. Tails of such

distributions thinner. If a curve of a distribution is less peaked than a Mesokurtic curve, it is referred to as

a Platykurtic curve.
The main difference between skewness and kurtosis is that the skewness refers to the degree of

symmetry, whereas the kurtosis refers to the degree of presence of outliers in the distribution.

Correlation

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.

The main result of a correlation is called the correlation coefficient (or “r”). It ranges from -1.0 to +1.0.

The closer r is to +1 or -1, the more closely the two variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as

one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets

smaller (often called an “inverse” correlation).

You might also like