You are on page 1of 2

UNIT IV Data Preparation And Analysis

Data Preparation: includes editing, coding, and data entry and is

the activity that ensures the accuracy of the data and their
conversion from raw form to reduced and classified forms that
are more appropriate for analysis. Preparing a descriptive
statistical summary is another preliminary step leading to an
understanding of the collected data.
Editing, Coding, Data Entry: Editing detects errors and
omissions, corrects them when possible, and certifies that
maximum data quality standards are achieved. Types of Editing
Field Editing and Central Editing. Coding involves assigning
numbers or other symbols to answers so that the responses can
be grouped into a limited number of categories. In coding,
categories are the partitions of a data set of a given variable
(e.g., if the variable is gender, the partitions are male and
female). Categorization is the process of using rules to partition
a body of data. Both closed- and open-response questions must
be coded. A codebook, or coding scheme, contains each variable
in the study and specifies the application of coding rules to the
variable. It is used by the researcher or research staff to promote
more accurate and more efficient data entry or data analysis. It is
also the definitive source for locating the positions of variables in
the data file during analysis. Coding rules - Four rules guide the
precoding and postcoding and categorization of a data set. The
categories within a single variable should be: Appropriate to the
research problem and purpose. Exhaustive. Mutually exclusive.
Derived from one classification dimension. Content analysis
follows a systematic process for coding and drawing inferences
from texts. It starts by determining which units of data will be
analyzed. Content Analysis Types: 1) Syntactical units can
be words, phrases, sentences, or paragraphs; words are the
smallest and most reliable data units to analyze; 2) Referential
units are described by words, phrases, and sentences; they may
be objects, events, persons, and so forth, to which a verbal or
textual expression refers; 3) Propositional units are assertions
about an object, event, person, and so on; 4) Thematic units are
topics contained within (and across) texts; they represent higherlevel abstractions inferred from the text and its context. Missing
data are information from a participant or case that is not
available for one or more variables of interest. In survey studies,
missing data typically occur when participants accidentally skip,
refuse to answer, or do not know the answer to an item on the
questionnaire. Data entry converts information gathered by
secondary or primary methods to a medium for viewing and
manipulation. Keyboarding remains a mainstay for researchers
who need to create a data file immediately and store it in a
minimal space on a variety of media.
Validity of data: In general, validity is an indication of how sound
your research is. More specifically, validity applies to both the
design and the methods of your research. Validity in data
collection means that your findings truly represent the
phenomenon you are claiming to measure. Valid claims are solid
Qualitative Vs Quantitative data analyses: Read Exhibit 7-2.
Bivariate and Multivariate statistical techniques: Bivariate
studies are different from univariate studies because it allows
the researcher to analyze the relationship between two variables
(often denoted as X, Y) ins order to test simple hypotheses of
association and causality. For example, if you wanted to know
whether there is a relationship between the number of students
in an engineering classroom (independent variable) and their
grades in that subject (dependent variable), you would use
bivariate analysis since it measures two elements based on the
observation of data. Four steps to conducting bivariate analysis:
1) Define the nature of the relationship; 2) Identify the type and
direction of the relationship; 3) Determine if the relationship is
statistically significant; 4) Identify the strength of the relationship.
Multivariate studies are similar to bivariate studies, but
multivariate studies have more than one dependent variable. For
example, if an advertiser wanted to examine the effectiveness of
three different banner ads on a popular website, the advertiser
could measure the ads click rate for both men and women.
Researchers could then use multivariate statistical analysis to
examine the relationships between all of the variables.
Multivariate analytical techniques represent a variety of
mathematical models used to measure and quantify outcomes,
taking into account important factors that can influence this
relationship. The most popular is multiple regression analysis
which helps one understand how the typical value of the
dependent variable changes when any one of the independent
variables is varied, while the other independent variables are
held fixed. Other techniques include factor analysis, path analysis
and multiple analyses of variance (MANOVA).
Factor analysis: It is a statistical tool that measures the impact of
a few un-observed variables called factors on a large number of
observed variables. It is used as a data reduction method. It may

be used to uncover and establish the cause and effect

relationship between variables or to confirm a hypothesis. It is
often used to determine a linear relationship between variables
before subjecting them to further analysis. Principal Factor
Analysis is also called Common Factor Analysis and it aims to
identify the minimum number of factors that can lead to the
correlation between a given set of variables. Other types of Factor
Analysis include Image factoring, Alpha factoring, Principal
Component Analysis and so on.
Discriminant analysis: It is a statistical tool with an objective to
assess the adequacy of a classification, given the group
memberships; or to assign objects to one group among a number
of groups. For any kind of Discriminant Analysis, some group
assignments should be known beforehand. Discriminant Analysis
is quite close to being a graphical version of MANOVA and often
used to complement the findings of Cluster Analysis and Principal
Components Analysis. When Discriminant Analysis is used to
separate two groups, it is called Discriminant Function Analysis
(DFA); while when there are more than two groups the
Canonical Varieties Analysis (CVA) method is used. Discriminant
Analysis has various benefits as a statistical tool and is quite
similar to regression analysis. It can be used to determine which
predictor variables are related to the dependent variable and to
predict the value of the dependent variable given certain values
of the predictor variables. Discriminant Analysis is also
widely used to create Perceptual Mapping by marketers and has
some benefits over other methods that use perceived distances;
like the option of using tests of significance to check for
dissimilarities among products and that the distances between
two products would not be impacted by other products included
in the study. Discriminant Analysis is often used in combination
with cluster analysis. Say, the loans department of a bank wants
to find out the creditworthiness of applicants before disbursing
loans. It may use Discriminant Analysis to find out whether an
applicant is a good credit risk or not
cluster analysis: It is a statistical tool used to classify objects into
groups, such that the objects belonging to one group are much
more similar to each other and rather different from objects
belonging to other groups. It is generally used for exploratory
data analysis and serves as a method of discovery by solving
classification issues. 1) Hierarchical cluster analysis methods
- Agglomerative methods in this, all objects start in separate
clusters till slowly similar objects are combined and this process
is repeated till all objects are in a single cluster. Finally, the
optimum number of clusters is chosen from among all options.
Divisive methods in this, all objects start in the same cluster
and the reverse of the agglomerative method is used. 2) Nonhierarchical Cluster Analysis method (also known as k-means
clustering methods): These are generally used when large data
sets are involved. Further, these provide the flexibility of moving
a subject from one cluster to another. The main benefit of Cluster
Analysis is that it allows us to group similar data together. This
helps us identify patterns between data elements. It reveals
associations between data objects and helps to outline structure
which might not have been apparent previously but gives much
sense and meaning to the data when discovered. Once a clear
structure emerges, it allows easier decision making.
multiple regression and correlation: Multiple regression is also
known as logistic regression - Logistic regression aims to
measure the relationship between a categorical dependent
variable and one or more independent variables (usually
continuous) by plotting the dependent variables probability
scores. A categorical variable is a variable that can take values
falling in limited categories instead of being continuous. Logistic
regression uses regression to predict the outcome of a categorical
dependent variable on the basis of predictor variables. The
probable outcomes of a single trial are modeled as a function of
the explanatory variable using a logistic function. Logistic
modeling is done on categorical data which may be of various
types including binary and nominal. For example, a variable
might be binary and have two possible categories of yes and
no; or it may be nominal say hair color maybe black, brown, red,
gold and grey. Another objective of logistic regression is to check
if the probability of getting a particular value of the dependent
variable is related to the independent variable. Multiple logistic
regression is used when there are more than one independent
variables under study. For e.g., Logistic Regression would help
identify factors like product quality, service quality, brand image,
reward programs, etc., that impact customers loyalty and
willingness to recommend a retail stores products to others. The
results would help improve the stores performance on these
parameters and increase customer loyalty.
multidimensional scaling: is a means of visualizing the level of
similarity of individual cases of a dataset. It refers to a set of
related ordination techniques used in information visualization, in
particular to display the information contained in a distance

matrix. Steps: 1) formulating the problem; 2) Obtaining input

data; 3) Running the MDS statistical program; 4) Decide number
of dimensions; 5) Mapping the results and defining the
dimensions; 6) Test the results for reliability and validity; 7)
Report the results comprehensively. For e.g, In marketing, MDS
is a statistical technique for taking the preferences and
perceptions of respondents and representing them on a visual
grid, called perceptual maps. By mapping multiple attributes and
multiple brands at the same time, a greater understanding of the
marketplace and of consumers' perceptions can be achieved, as
compared with a basic two attribute perceptual map
Application of statistical software for data analysis: Following are
the statistical software and the features it has for doing data
analysis: 1) SAS/STAT: SAS/STAT software is designed for both
specialized and enterprise wide analytical needs. It uses more
of coding and little less of menu-driven way of doing

comprehensive set of tools that can meet the data analysis needs
of the entire organization. Features: Anova; Mixed Models
Linear mixed, non-linear mixed and general linear models;
Regression; Categorical data analysis; Bayesian analysis;
Multivariate analysis; Survival analysis; Psychometric analysis;
Cluster analysis; Nonparametric analysis; Survey data analysis;
Mutiple imputation for missing values. 2) SPSS: It is more
menu driven and less coding; Analysing variables seperately;
Comparing multiple variables; Association between variables. 3)
R: It is all coding for doing all the latest methods of doing
data analysis. Every data analysis method can be done using R;
Creating unique and beautiful data visualizations; Getting better
results faster; Draw on the talents of statisticians worldwide as
they make method libraries for free usage.