An Overview of Techniques For Dealing With Large Numbers of Independent Variables in Epidemiologic Studies

PFEVENTIVE
VETERINARY
MEDICINE
Preventive Veterinary Medicine 29 ( 1996) 22 l-239
An overview of techniques for dealing with large

numbers of independent variables in epidemiologic
studies
I.R. Dohoo a**, C. Ducrot b, C. Fourichon ‘, A. Donald a,
D. Humik a
aDepartment of Health Munugement, Atlantic Veterinary College, University of P.E.I., Charlottetown, P.E.I.
CIA 4P3, Canada
b Centre d’Ecopathologie Animale, 26 rue de la BaLse, 69100 Villeurbanne, France
’ [NRA-Veterinary School, Unit of Animul He&h Management, CP 3013,44087 Nantes, cedex 03, Frunce
Accepted 13 May 1996
Abstract
Many studies of health and production problems in livestock involve the simultaneous
evaluation of large numbers of risk factors. These analyses may be complicated by a number of
problems including: multicollinearity (which arises because many of the risk factors may be
related (correlated) to each other), confounding, interaction, problems related to sample size (and
hence the power of the study), and the fact that many associations are evaluated from a single
dataset. This paper focuses primarily on the problem of multicollinearity and discusses a number
of techniques for dealing with this problem. However, some of the techniques discussed may also
help to de.al with the other problems identified above.
The first general approach to dealing with multicollinearity involves reducing the number of
independent variables prior to investigating associations with the disease. Techniques to accom-
plish this include: (1) excluding variables after screening for associations among independent
variables; (2) creating indices or scores which combine data from multiple factors into a single
variable; (3) creating a smaller set of independent variables through the use of multivariable
technique.5 such as principal components analysis or factor analysis.
The second general approach is to use appropriate steps and statistical techniques to investigate
associations between the independent variables and the dependent variable. A preliminary
screening of these associations may be performed using simple statistical tests. Subsequently,
multivariable techniques such as linear or logistic regression or correspondence analysis can be
* Corresponding author.
0167.5877/96/$15.00 Copyright 0 1996 Elsevier Science B.V. All rights reserved.

PN SOl6’7-5877(96)01074-4
222 I.R. Dohoo et al./Preuentive Veterinary Medicine 29 (1996) 221-239
used to identify important associations. The strengths and limitations of these techniques are
discussed and the techniques are demonstrated using a dataset from a recent study of risk factors
for pneumonia in swine. Emphasis is placed on comparing correspondence analysis with other
techniques as it has been used less in the epidemiology literature.
Keywords: Epidemiologic methods; Multivariable analysis; Multicollinearity; Correspondence analysis
1. Introduction and description of the problems
One of the objectives of veterinary epidemiology is the identification and quantifica-

tion of risk factors for animal health and production problems (Martin et al., 1987).
These risk factors include specific etiologic agents, host factors and environmental
factors. Determining which factors are important is a difficult task and this problem
becomes particularly acute in investigations of herd level risk factors for health and
production problems in livestock. In many of these investigations, data on numerous
factors relating to the management of livestock are collected and it becomes a form-
idable task to sort through the large number of independent variables to determine which
factors play an important role in the health or production problem investigated. (In order
to simplify the terminology, the term ‘disease’ will be used throughout this paper to
refer to all types of health and production problems which may be under investigation.)
The problem may be compounded by the fact that the data may have been collected
on relatively few farms since the time and effort required to collect accurate data may be
extensive. This will result in a dataset with relatively few observations but many
independent variables. While this paper will focus on the analysis of herd level data, the
issues discussed may be of equal concern in an investigation of risk factors for diseases
when most data are collected at the individual animal level.
The problems to be dealt with when analysing herd level data sets with many
independent variables fall into five main areas. A major problem, and the prime focus of
this paper, is that of multicollinearity. This arises when many of the factors studied are
closely related to other factors under investigation (i.e. they are highly correlated) and
separating their effects becomes difficult. A second, but related problem, is that of
confounding. Confounding arises when a factor is causally related to a predictor of
interest and to the outcome of interest. Data on the confounding variable may, or may
not, have been collected in the study. While the techniques discussed in this paper will
focus on the problem of multicollinearity, some of them may also contribute to solving
the problem of confounding.
There are three other problems that must be considered in the analysis of herd level
data.
1. The first relates to the sample size and the power of the study. A study may be too
small to detect important associations or conversely, it may be too large with
meaningless associations being declared significant.
2. The second problem is a function of the number of factors being investigated. With
multiple factors being considered, the possibility of finding associations ‘due to
chance alone’ goes up substantially.
I.R. Dohoo et al./Preventive Veterinary Medicine 29 (1996) 221-239 223
3. The third problem is that the effects of one factor may vary as the level of another
factor changes (i.e. interaction).
While it is important to consider each of these five problems in a herd-level study,
discussion of them all is beyond the scope of this paper. However, the potential methods
of dealing with multicollinearity that are presented in this paper, may also help reduce
the effects of some of the other problems mentioned above.
Multicollinearity occurs when predictor (independent) variables are not statistically
independent (i.e. they are related to each other). It may be as simple as two independent
variables being highly correlated or a linear combination of a set of independent
variables. may be highly correlated with another predictor (Glantz and Slinker, 1990).
Multicollinearity results in:
unstable estimates of regression coefficients in linear and logistic regression models;
incorrect variance estimates for the coefficients of those parameters in regression
models (leading to inflated standard errors and loss of statistical power);
difficulties in the numerical calculations involved in fitting the regression model
(when the problem is severe).
Multicollinearity may arise from two sources. Sample-based multicollinearity arises
from the inclusion of correlated predictor variables (e.g. type of flooring and type of
bedding used in livestock housing may be highly correlated). This is the most common
source of multicollinearity in epidemiologic studies. Structural multicollinearity arises
from the creation of correlated variables by adding power terms (e.g. quadratic terms) or
interaction terms to the regression model. This can usually be dealt with by ‘centring’
the variables of interest (i.e. subtracting the mean from the variable) before computing
the pow’er or interaction terms (Glantz and Slinker, 1990). For example, if x and x2 are
highly correlated but you want them both in the model, use (x - X) and (x - X)2
instead. Another approach is to replace the pair of variables with a single transformed
variable (e.g. In x may do as well as the combination of x and x2>. The problem of
structural multicollinearity will not be discussed further in this paper so multicollinearity
will refer to sample-based multicollinearity.
2. Potential solutions
There are two complementary approaches to dealing with the problem of multi-
collinearity.
1. The first is to reduce the number of independent variables prior to investigating
associations with the disease. This can be accomplished by screening for multi-
collinearity and selecting variables, creating scores or indices which combine data
from several independent variables, or using multivariable methods such as principal
com:ponent analysis or factor analysis which summarise the information contained in
the original independent variables into a smaller set of variables.
2. The second approach is to use appropriate statistical techniques to fully investigate
potential associations between the independent and dependent variables. These
techniques include simple statistical tests for screening associations, and multivari-
able techniques to simultaneously investigate all possible associations between the
dependent variable and the independent variables.
224 I.R. Dohoo et al./ Preventive Veterinary Medicine 29 (1996) 221-239
3. Reducing the number of independent variables
3. I. Correlation analysis
One strategy for reducing the number of independent variables is to screen potential
predictor variables using simple (unconditional) statistics and then select a subset of
independent variables for inclusion in the final analysis. Simple correlation analyses are
used to determine if any pairs of predictor variables are highly correlated and therefore
likely to result in multicollinearity. If any such pairs are found, one of the predictor
variables is selected for inclusion in the final analysis and the other is ignored. There are
several limitations to this approach. First, selecting the level of the correlation coeffi-
cient that represents a problem is arbitrary. While multicollinearity is almost certain to
be a problem with correlation coefficients over 0.9, it may occur at lower levels. Second,
the choice of which independent variable is to be removed is arbitrary, and the
investigator must use their knowledge of the production system to make an appropriate
decision. Finally, multicollinearity can arise if any linear combination of independent
variables is correlated with any other linear combination. Thus, examining variables in a
pairwise manner will not necessarily remove all sources of multicollinearity.
3.2. Herd management indices
Several studies have used herd management indices to combine data from multiple
independent variables into a single variable that represents the level of a group of factors
in a herd. For example, James (1991) used data on record keeping practices, cattle
handling facilities and herd management practice to compute an index of adoption of
health management practices in beef cow-calf herds. Mohammed (1990) developed an
index of poultry herd hygiene from data collected in a large number of poultry herds and
subsequently evaluated the ability of the index to predict a herd’s infection (Mycoplasma
gaflisepticum) status. The advantages of this approach are two-fold. First, the creation of
the score or index is totally under the control of the investigator who can use their
knowledge of the livestock industry to design the scoring system. It may be based on an
investigator’s perception of the area being studied (James, 1991) or on the analysis of
data collected from study herds (Mohammed, 1990). Second, the scoring system can
subsequently be applied to other farms not included in the original study. The disadvan-
tages are that the method of creating the scores is arbitrary and the use of an index in
analyses of risk factors precludes the evaluation of the effects of individual factors
which make up the score. This may make it difficult to make recommendations about
changes in management that will alter the risk of the disease being studied.
3.3. Principal components analysis
There are two closely related statistical techniques which can be used to consolidate
the information contained in all of the predictor variables into a new set of uncorrelated
(i.e. orthogonal) predictor variables. The first, principal components analysis, has
recently been reviewed in detail (Lafi and Kaneene, 1992) in the veterinary epidemiol-
I.R. Dohoo et ul./ Preventive Veterinuy Medicine 29 (1996) 221-239 225
ogy literature. If a study has k predictor variables, principal components analysis will
create k new predictor variables called principal components. These new variables are
uncorrela.ted. The computational technique automatically orders the components so that
each successive component contains a decreasing proportion of the total variation among
the independent variables. Consequently, the first principal component contains the
largest amount of information from the collection of original predictor variables while
the last rnay contain very little additional information.
Once principal components have been created, it is common to select a subset to use
as predictors in a multivariable analysis such as linear or logistic regression. The number
of principal components selected is up to the investigator and in some cases the full
complement is used. One commonly used rule of thumb is to select components that are
associated with eigenvalues greater than 1. (An eigenvalue is a measure of the amount of
variation among the predictor variables that is accounted for by a principal component.)
Once the regression model has been fit and regression coefficients obtained for each of
the selected principal components, matrix algebra can be used to convert these coeffi-
cients back into regression coefficients for the original independent variables. These
latter coefficients are more stable than those obtained from the regression based on the
original independent variables. There are two major limitations to principal components
analysis. First, principal components are merely mathematical constructs that have no
intrinsic meaning and consequently, it is impossible to interpret the regression coeffi-
cients of the principal components. Second, deciding how many principal components to
include in a regression analysis is an arbitrary process.
Conv’ersion of the principal components coefficients back into coefficients for the
original predictor variables does provide values which can be interpreted. However, this
process results in the calculation of coefficients for all k predictor variables so there has
been no selection of predictors as being important. Nor is there any test of significance
for each of the coefficients. One approach to evaluating the ‘importance’ of these
coefficients is to standardise them in order to determine which independent variables
produce the largest change in the dependent variable when the independent variable
changes by one standard deviation. While this will provide an estimate of the importance
of each independent variable, it does not indicate whether the variable is a statistically
significant predictor. Also, the standardisation is based on the estimate of the standard
deviation of the predictor variable and this may vary across populations.
3.4. Factor analysis
The closely related technique of factor analysis is based on the assumption that a set
of factors which do have an inherent meaning of their own can be computed as weighted
sums of the original variables. For example, in the original study from which the data
used in this paper came, Hurnik et al. (1994a) and Humik et al. (1994b) used factor
analysis to create six factors which they claimed represented types of farms in Prince
Edward Island. They suggested that the six factors represented farms with ‘extensive
housing’, ‘group pig management’, ‘room pig management’, ‘multiple source finishing
farms’, ‘floor-fed finishing farms’ and ‘integrated farms’. Sieber et al. (1987) used
factor analysis to combine data on linear type scores for dairy cows into factors
226 I.R. Dohoo et al./Preuentiue Veterinary Medicine 29 (1996) 221-239
representing different types of cows (e.g. ‘big, strong cow’, ‘cow with high wide rear
udder’). The process of assigning meaning to these mathematically constructed variables
is called reification and may or may not be justified. Deciding whether the new variables
(factors) have physical reality is a subjective procedure.
Mathematically, the fundamental difference between principal components analysis
and factor analysis lies in the structure of the underlying model (Chatfield and Collins,
1980). In principal components analysis, the original variables (xi) are assumed to be a
linear combination of the principal components CC,) with weights (a,) (these are also
called component scores)
xi = ai,c, + cq2c2 + . . . +aikck
The number of principal components (k) equals the number of original variables. If
only a subset of the principal components are used in subsequent analyses, it is assumed
that the weights associated with the other components equal zero. The weights associ-
ated with the components in the subset do not change as the number of components
selected changes.
In factor analysis, the original variables (xi) are assumed to be a linear combination
of the factors <fj) with weights (aj) (these are also called factor loadings) plus an error
term (Ed).
xi = (Y;,f, + (Yi2f2 + . . . +‘yijfj + Ei
The number of factors (j) is less than the number of original variables (k) and
unaccounted for variation is assigned to the error term (also called the specific factor).
There are two major drawbacks to factor analysis. First, the factor loadings change if the
investigator decides to recompute the analysis with a different number of factors. (This
does not happen in principal component analysis; the components are fixed.) Second, the
incorporation of the error term means that it is very difficult to compute factor scores for
a new observation, and hence do any model validation. Consequently, factor analysis
should only be used if there is good evidence that the underlying factors really do exist
and can be computed from the data.
Even if it is accepted that the factors do represent underlying, unmeasured entities, it
may be desirable to extrapolate the findings back to the original predictor variables since
it is not possible to completely change a farm type in order to modify the risk of a
disease. For example, Humik et al. (1994b) noted that one type of farm (‘multiple
source finishing farms’) had a higher risk of having pneumonia (odds ratio 2.4). They
hypothesised that three specific risk factors that contributed heavily to this factor
(buying pigs from multiple sources, not being a farrow-to-finish farm and not having
routine veterinary visits) were probably associated with the increased risk of pneumonia.
However, this can only be done in a subjective way by examining the factor loadings
since the error term in the factor analysis model makes it very difficult to compute
coefficients for the original variables (as was done in principal components analysis).
A slightly different application of factor analysis is to use the technique as a way of
combining the information from a group of related independent variables (e.g. variables
dealing with flooring) into a single or small number of factors. This is analogous to the
creation of indices or scores for areas of farm management that was discussed above.
I.R. Dohoo et ul./ Preuentiue Veterinary Medicine 29 (1996) 221-239 227
The adv.antages of this approach is that it avoids any a priori weighting of variables in
the creation of scores or indices and secondly, if more than one factor (score) is selected
within a management area, the two will be uncorrelated. Factors within a management
area (eg. flooring) may also be easier to interpret than factors based on the whole range
of independent variables. On the other hand, while multiple factors within a manage-
ment area (e.g. flooring) will be uncorrelated, these factors may still be correlated with
factors from other areas (e.g. ventilation) and the problem of multicollinearity may still
exist.
4. Steps and techniques to investigate associations
4.1. Screening simple associations
A frequently used preliminary step in the analysis of data from herd level studies is to
evaluate the magnitude of the association between each independent (predictor) variable
and the outcome of interest using simple statistics such as the t-test, x2 statistic or
simple linear regression. Only variables with a significant unconditional association with
the dependent variable are then included in the subsequent multivariable analyses,
although a quite liberal level of significance (e.g. 0.1 or 0.2) is often chosen as the
cut-off. A disadvantage of this approach is that important predictor variables may be
excluded if their effect is masked by another variable. It is also impossible to consider
interaction effects in this original screening so they may go undetected. Finally, although
this approach reduces the number of independent variables in the final analysis, it does
not completely eliminate the problem of multicollinearity as selected variables may still
be correlated.
4.2. Linear and logistic regression
Multivariable techniques such as linear regression and logistic regression are now the
most commonly used analyses in veterinary epidemiologic studies. Sophisticated soft-
ware is now available to carry out linear and logistic regression analyses, which
automatically selects important predictor variables for inclusion in the model through
some form of forward, backward or stepwise selection process. While intuitively
appealing, there are serious pitfalls to this approach. Both regression techniques assume
independence among the independent variables and consequently, if any multicollinear-
ity is present, different model building strategies may produce very different results.
When building a regression model, the investigator usually endeavours to find the
most simple (parsimonious) model which adequately describes the data. There are two
important issues to be considered in this process. First, there are many different ways of
deciding what constitutes the ‘best’ model (e.g. r2, adjusted r2, Mallow’s C,, etc.>.
Second, there are many model building strategies. These include sequential selection
procedures (e.g. forward selection, backwards elimination and stepwise), and more
complex methods requiring greater input from the investigator such as the method for
evaluating both interaction and main effects described in Kleinbaum et al. (1982). A
discussion of these issues is beyond the scope of this paper but it must be noted that
different model building strategies may result in very different final models. The reader
is referred to textbooks which deal with many of the issues in linear (Kleinbaum et al.,
1988; Glantz and Slinker, 1990) and logistic (Hosmer and Lemeshow, 1989; Collett,
1991) regression for a complete discussion of these issues.
4.3. Correspondence analysis
Correspondence analysis is a form of exploratory data analysis that has been

developed to analyse the complex relationships among qualitative variables (Lebart et
al., 1984). Used in ecological and sociological studies (Moles, 19901, it has been applied
to animal epidemiology since the 1970s (Madec and Tillon, 1988). Correspondence
analysis can be considered a dual principal components analysis of a contingency table
(Hoffman and Franke, 1986). “Multiple correspondence analysis involves projection of
the n-dimensional cloud of points representing variable modalities (category combina-
tions) in a subspace with fewer dimensions, composed of a number of mutually
orthogonal factorial axes. Construction and selection of the factorial axes is performed
so as to keep in the projection as much of the inertia, or variability, as possible of the
complete data set.” (L evenstein et al., 1992) The main objective of correspondence
analysis is to summarise the associations among a set of categorical variables in a small
number of dimensions, and to give a low-dimensional (often a two-dimensional)
graphical representation of these associations.
Correspondence analysis deals only with categorical variables so it avoids any
assumptions about the distributions of the variables. The use of variables measured on a
categorical scale also allows the investigation of non-linear relationships between
dependent and independent variables and the technique is quite robust (Fourichon et al.,
1991). However, quantitative (interval) data have to be categorised and the choice of the
cut-off points is arbitrary and may affect the results. Furthermore, even though corre-
spondence analysis has been classified in this paper as a technique for evaluating
associations with the dependent variables, by itself it is not sufficient to complete those
evaluations. It does not quantify the effect of risk factors on the disease nor does it allow
for the detection and evaluation of confounding or interaction effects. For that reason, it
should be used in conjunction with other unconditional and multivariable analyses (e.g.
logistic regression) (Ducrot et al., 1994).
The first step in correspondence analysis is the structuring of the data into a
contingency table, in which the columns and the rows are defined by all possible
combinations of levels of the variables. The cell values in the table represent the number
of observations with that set of row and column characteristics. Table 1 shows a simple
hypothetical example of a sample dataset with 100 observations, two independent
variables (fan ventilation and herd size) and a trichotomous dependent variable (disease).
The investigator then determines which variables will be used to construct the
factorial axes and these are referred to as ‘active’ variables. In an analytical study, it is
usual to choose the independent variables as ‘active variables’. Factorial axes are then
computed so as to keep the distances between modalities (category combinations) of
I.R. Dohoo et al./Preventive Veterinury Medicine 29 (1996) 221-239 229
Table 1
Hypothetical data showing the original data strucmre and the conversion of the data into a contingency table as
the first step in a correspondence analysis
Original data (one record per observation)
Observation Disease Fan Herd size

no. (1,2or3) ventilation (1,2or3)
(0 or 1)
1 1 0 2
2 3 0 2
3 1 1 1
... ... ... .
100 2 1 3
Contingency table arrangement of data for correspondence analysis
Disease Fan ventilation Herd size
0 (No) 1 (Yes) 1 (Small) 2 (Medium) 3 (Large)

I (Negative) 12 16 8 12 8
2 (Mild) 3 21 5 12 7
3 (Severe) 28 20 12 20 16
‘active’ variables in the projection as close as possible to distances in the original

n-dimensional space. The first calculated axis takes as much of the inertia (variability) in
the n-dimensional space as is possible in one dimension. The second axis is orthogonal
to the first and explains as much of the remaining inertia, and so on. The proportion of
inertia explained by each factorial axis is determined and those with a value over an
investigator-chosen threshold are retained. The participation (‘loading’) of each active
variable in the inertia of each factorial axis is calculated in order to interpret the
information summarised by each axis.
The investigator can then plot new variables, called ‘illustrative’ variables onto the
projection. These illustrative variables are ‘passive’ in that they do not participate in
defining the factorial axes, but their relationships with factorial axes can be analysed.
These are usually the dependent variables in an analytical study. Passive and active
(dependent and independent) variables are plotted on a joint display (Fig. 1) to show the
relationships among all of the variables. Consequently, the independent variables are
used tcl build the factorial axes, but the dependent variables are plotted on the same axes.
The statistical significance of the links between the disease (illustrative variable) and the
factors (created with the active variables) can be assessed using a squared cosine statistic
(Hoffman and Franke, 1986). This helps determine which factors and original variables
are most closely associated with the disease.
Once the joint display of the spatial variation has been completed, the factorial
coordinates of the individual observations are often used in some form of cluster
analysis (e.g. hierarchical clustering; Roux, 1991) to complete the analysis. This process
identifies groups, or clusters of observations (farms) which have similar risk factor
profiles.
imension 2 > 3 m3 I pig
c 2.7 pigs / m2
5 50 neighbour < 700 pigs /year

VlSllS i year
not minimal
2 - 9 sources of pigs disease pigs
meumonia 11- 20 years
sva/. > 40 % > 20 years Of mixmg of experience
experience groups pneumonia
fed on reval. IO-40 %
the floor
no dry feed
2 m3 /pig
dry feed
> 1400 pigs / year > 3.5 pigs / m2 not fed on the floor
1 source of pigs
l- 4 neighbour
visits /year no neighbour visit pIWlllWll~
2.7 - 3 5 pigs / m2 pm/a/. < 10 %
700 . 1400
pigs / year no mixing
groups
2-3m pig
1 .lO years of
minimal experience
disease pigs
Fig. 1. Plot of correspondence analysis results. Both predictor variables and the outcome variable (prevalence of pneumonia divided into three categories) are plotted
on the first two dimensions (which load 28.8% of the total inertia of the n-dimensional space) of the correspondence analysis.
I.R. Dohoo et al./Preventive Veterinmy Medicine 29 (1996) 221-239 231
5. Example: risk factors for pneumonia in swine
The data for this example were taken from the study carried out by Hurnik et al.
(1994a) and Humik et al. (1994b) which investigated a wide range of risk factors for
both pneumonia and pleuritis in hog farms in Prince Edward Island. Complete data were
available from 69 farms and a total of 43 independent variables dealing with herd
demogra.phics, housing, ventilation, feed, management procedures, exposure to person-
nel and labour were included in the study (Table 2). Data from the abattoir inspection of
slaughter hogs were extracted from the APHIN (Dohoo, 1988) database and the
prevalence of pneumonia on each farm determined. The objective of this section is to
present highlights of the results from each of the analytical techniques presented earlier
in this paper. However, since data for the creation of herd management indices (e.g. a
barn sanitation index) were not collected in the original study, these will not be included
in the results presented below.
5.1. Correlation analysis
A total of 861 pairwise correlation coefficients among the independent variables were
examined, the highest numerically being 0.78 for the correlation between the densities of
Table 2
Variables used in the analysis of factors affecting the frequency of swine pneumonia in Prince Edward Island
swine herds. Data collected from 69 herds
Variable Description Variable Description
size Number of pigs sold per year strdnst Floor space-starter hogs (m’)
growth Average daily gain grwdnst Floor space-growing hogs cm*)
cmpfd Complete feed fed fnrdnst Floor space-finishing hogs (m’)
suPP1 Supplement added to feed lqdmnr Manure handled as liquid
Prmx Premix added to feed floor Type of flooring
strmed Medicated starter feed used sldprtn Solid partitions between pens
selenium Selenium added to feed opnprtn Open partitions between pens
dryfd Feed fed dry (vs. wet) hlfsld Half solid partitions between pens
flrfd Pigs fed on floor pigwtr Pigs per water nipple
rooms Number of separate rooms in barn numpen Number of pens
m3pig Air volume per pig (m’) mixgrp Pigs from multiple groups mixed
shipm2 Density (pigs shipped m-*) hldbck Slow growing pigs held back from slaughter
exhaust Exhaust fan capacity dstfrm Distance (km) to nearest hog farm
(propn. of recommendation)
inlet Air inlet size hmtsd Home raised pigs kept on farm
(propn. of recommendation)
maninlt Manual adjustment of inlets nmbsrc Number of sources of pigs
mixmnr Manure mixed between pens mnlds Only minimal disease pigs
straw Straw bedding used vet Vet. visits per year
washpns Frequency of pen washing (year- ’ ) feedsls Feed salesman visits per year
You Owner works in barn neighbr Neighbour visits per year
family Family works in barn pigprdc Pig producer visits per year
hrdhlp Hired help works in barn UWkr Trucker visits per year
exprnce Years of experience raising hog
Table 3
Correlation matrix of all independent variables for which at least one pairwise correlation exceeded 0.5
size lqdmnr floor SUPPI Prmx dryfd fhfd straw grwdnst fnrdnst sldprtn hlfsld hmrsd nmbsrc
size 1.00
lqdmnr 0.3 * 1.00
floor 0.51’ 0.46 * I .oo
SUPPl 0.03 -0.36 * 0.06 I .oo
Prmx 0.19 0.25 0.17 -0.55 ’ I.00
dryfd - 0.23 -0.27 -0.21 - 0.06 -0.13 I .oo
flrfd 0.35 ’ 0.30 0.13 0.06 0.17 -0.55 * 1.00
straw -0.37 * -0.75 * -0.36 * 0.22 - 0.02 0.2 I -0.20 1.00
grwdnst - 0.30 - 0.27 -0.18 - 0.04 0.08 0.08 -0.21 0.33 ’ 1.00
fnrdnst -0.35 * -0.30 - 0.28 -0.13 - 0.04 0.20 -0.19 0.28 0.78 * 1.00
sldprm - 0.26 -0.35 * - 0.25 0.13 -0.33 ’ 0.34 * -0.20 0.10 0.1 I 0.14 1.00
hlfsld 0.45 * 0.44 * 0.46 * -0.15 0.21 - 0.27 0.22 - 0.23 -0.31 * - 0.24 -0.70 * 1.00
hmrsd -0.01 0.04 0.15 0.1 I - 0.02 0.0 1 0.03 - 0.28 -0.16 -0.06 0.00 0.06 1.oo
0.08 0.07 0.12 - 0.09 - 0.03 - 0.09 0.03 0.03 -0.01 - 0.08 -0.15 0.21 -0.57 * 1.00
* P <O.Ol.
I.R. Dohoo et al./ Preventive Veterinary Medicine 29 11996122 I-239 233
Table 4
Results from principal components analysis of swine farm management data. Only factors with eigenvalues
greater than 2.0 were retained
Component Eigenvalue Cumulative Component scores for the first three
proportion a variables in the dataset
size growth cmpfd
I 5.8 0.14 0.30 0.10 -0.12

2 3.2 0.21 0.13 -0.10 - 0.24
3 3.0 0.28 -0.01 -0.14 - 0.05
4 2.6 0.34 0.14 -0.16 0.18
5 2.4 0.40 0.12 0.13 - 0.06
6 2.3 0.45 0.12 - 0.04 - 0.33
7 2.0 0.50 0.03 0.11 0.10
a Cumulative proportion of variation among the independent variables explained by the components.
grower pigs and finisher pigs. A similarly large (but negative) correlation (r = -0.75)
existed b,etween the use of straw bedding and liquid manure handling. Eight coefficients
had absolute values greater than 0.5 and a further 24 were between 0.4 and 0.49 (Table
3). While none of these individual correlation coefficients indicated a definite problem
with multicollinearity, it was evident that there was some redundancy in the information
contained in the set of independent variables.
5.2. Principal components analysis
When principal components analysis was used to investigate relationships among the
independent variables, the first seven principal components had eigenvalues greater than
2 and together they explained 50% of the total variation among the independent
variables (Table 4). The remaining 36 principal components explained the other 50%.
Consequently, the first seven were selected for inclusion in subsequent regression
analyses (discussed below). The component scores for the first three variables in the
dataset (“size’, ‘growth’ and ‘cmpfd’) are included in Table 4 for example purposes.
Similar scores are computed for each of the 43 independent variables. The value of each
farm’s principal component score is the linear combination of all these scores (the
eigenveceor) multiplied by the standardised values of the independent variables for the
specific farm. These scores do not change as the number of principal components
retained 1s changed.
5.3. Factor analysis
The first six factors had eigenvalues greater than 2.0 and together they explained 56%
of the variability among farms in terms of the independent variables (Table 5). For each
factor, the three variables with the largest factor loadings are presented (Table 5). If only
these three loadings were considered in the definition of ‘farm type’, then factor 1 which
had large loadings on ‘liquid manure handling’, ‘partially solid pen partitions’ and
234 I.R. Dohno et ul./ Preventive Veterinury Medicine 29 (1996) 221-239
Table 5
Results from factor analysis of swine farm management data. Only factors with eigenvalues greater than 2.0
are presented
Factor Eigenvalue Cumulative Highest factor loadings b
proportion a
1 5.65 0.18 lqdmnr hlfsld size
(0.78) (0.71) (0.71)
2 2.99 0.27 expmce hldbcli mnlds
(0.61) (0.58) ( - 0.50)
3 2.79 0.35 sldprtn opnp* hmrsd
( - 0.69) (0.54) ( - 0.46)
4 2.40 0.43 opnprtn exhause mixgrp
( - 0.57) (0.49) (0.58)
5 2.11 0.49 You nmbsrc strdnst
(- 0.47) (-0.37) (0.37)
6 2.02 0.56 You cmpfd floor
(0.48) (- 0.43) (0.39)
a Cumulative proportion of variation among the independent variables explained by the factors.
b The variable is given, followed by the loading in parentheses.
‘larger herd size’ might be considered as representing large, newer and more intensive
hog operations in Prince Edward Island.
5.4. Simple (unconditional) associations with the dependent variable
A summary of the results of the analyses of the simple (unconditional) associations

between the 43 independent variables and the presence or absence of pneumonia on the
farm is presented in Table 6. These results were obtained from t-test or x2 analyses
depending on whether the independent variable had a continuous distribution or categor-
ical distributions. Twelve associations were significant at P = 0.05 and another two
associations had P-values between 0.05 and 0.1.
5.5. Logistic regression
For comparative purposes, two stepwise logistic regression analyses were carried out.
In the first, all variables with unconditional associations (P I 0.1) with the presence or
absence of pneumonia were made available to the model building process and stepwise
selection of variables employed. The variable ‘trucker’ (number of visits by truckers
during the year) was excluded from all logistic regressions since the variable had an
average of 0 among farms without pneumonia and the maximum likelihood estimation
procedure in the logistic regression analysis would not converge if the variable was
included. This analysis suggested that larger herds were more likely to have pneumonia
present but that raising some or all of your own weaner pigs and/or purchasing only
minimal disease stock reduced the risk.
Logistic regression was also carried out with principal component scores as the
independent variables. The principal component coefficients were subsequently con-
I.R. Dohoo et al./ Preuentiue Veterinary Medicine 29 (1996) 221-239 235
Table 6
Results of analyses evaluating risk factors for the presence/absence of pneumonia on hog farms in Prince
Edward Island. Unconditional associations were determined using r-tests and x2 analyses. Conditional
associations were determined using logistic regression based on the original variables or principal components
Independent variable Unconditional Conditional associations
associations (logistic regression)
Direction of association Original Principal
variables components
P IO.05 0.051P10.1 p (P-value) p (rat&) a
size (no. of pigs shipped year- ’ ) Pos _ + 0.002 (0.01) 0.0002 (6)
rooms (no of separate rooms) Pos
dstfrm (dirt. to nearest hog farm) Neg _ _
nmbsrc (no. of sources of pigs) Pos _ 0.088 (1)
vet (no. of vet. visits year- ’ ) Neg _ _
pigprdc (no. of pig prod. visits year- ‘) Pos
trucker (no. of trucker visits year- ’ 1 Pos _
expmce (no. years experience) Pos _ 0.010 (4)
floor (pigs fed on floor) Pos _ _
hldbck (slow pigs held back) Pos _ _ 0.282 (3)
hmrsd (some/all weaner pigs Neg _ - 1.36 (0.05) - 0.276 (2)
raised on harm)
mnlds (only minimal disease pigs) Neg _ - 1.94 (0.01) _
dryfd (pig:; fed dry feed) _ Neg _ _
hrdhlp (hbed help used in barn) Pos _
cmpfd (conplete feed fed) ns - 0.306 (5)
a Descendtng order based on standardised regression coefficient (i.e. rank (1) was the variable with the largest
standardised regression coefficient).
verted back to unstandardised coefficients for the original independent variables and the
six coefficients which had the largest standardised coefficients are presented. The main
difference between the results from the principal components regression and the
ordinary logistic regression was that the number of sources of pigs (nmbsrc) replaced
purchasing only minimal disease pigs (mnlds) as the most important predictor(s) of the
presence/absence of pneumonia. The relationship between the number of sources and
purchasing only minimal disease pigs would not have been detected looking at pairwise
correlations only.
5.6. Correspondence analysis
Correspondence analysis more clearly demonstrates associations when the categorical

dependent variable is coded in three or more levels. Consequently, the dependent
variable was coded 0 for very mild or no pneumonia (herd prevalence less than 0.11, 1
for mild levels of pneumonia (prevalence 0.1-0.39) and 2 for severe pneumonia
(prevalence 0.4 or higher). Correspondence analysis was applied to a selected range of
independent variables, using the ADDAD software (ADDAD, 1985). Initial variable
selection was performed so that variables not significantly linked to the prevalence of
236 I.R. Dohoo et d/Preventive Veterinary Medicine 29 (1996) 221-239
pneumonia (x2 test with P > 0.21, and those that were linked to pneumonia through a
confounder were excluded from the analysis. As a result, ten active variables, with 25
modalities (category combinations) were used in the analysis. Results are presented in
Fig. 1. The first principal axis accounted for 14.5% of the spatial variation in the data
while the second accounted for 13.3%. If no relationships among the active variables
had existed in the dataset, each factorial axis would have loaded, on average, 6.7% of
the inertia.
The display shows a strong structural relationship among many of the variables. For
example, several factors cluster at the left side of the plot. Large herds (over 1400 pigs
year-‘) are associated with moderate density (2.7-3.5 pigs m-*) barns, minimal air
space per pig (less than 2 m3 per pig), floor feeding, not feeding dry food, buying pigs
from two to nine sources and having an experienced producer (over 20 years experience).
Table 7
Hierarchical clustering analysis showing the distribution of dependent and independent variables in four farm
clusters
Variable (coding/units) Farm cluster (no. of farms)
A
(15) cB21, cc121 821)
Prevalence of pneumonia (% of lungs examined) < 10 9 14 3 2
10-40 6 3 4 2
240 0 4 5 17
Herd size <700 15 3 1 2
700-1400 0 13 6 7
21400 0 5 5 12
Air space per pig (m3) <2 0 2 6 8
2-3 1 19 5 9
>_3 14 0 1 4
Pigs me2 < 2.7 10 3 1 5
2.7-3.5 1 9 0 II
2 3.5 4 9 11 5
Pigs fed on the floor No 12 19 2 11
Yes 3 2 10 10
Pigs fed dry feed No 2 3 9 7
Yes 13 18 3 14
Farmer’s experience (years) l-10 5 16 1 1
11-20 6 3 5 7
> 20 4 2 6 13
Farm produces some (or all) of own weaner pigs No 4 8 0 17
Yes 11 13 12 4
No. of sources of weaner pigs 1 11 15 10 8
2-9 4 6 2 13
Farm purchases only minimal disease pigs No 11 5 IO 16
Yes 4 16 2 5
Slow growing pigs held back from shipping No 6 11 6 1
Yes 9 10 6 20
No. of neighbour visits per year 0 7 7 8 2
l-4 4 11 3 8
S-50 4 3 1 11
I.R. Dohoo et d/Preventive Veterinary Medicine 29 (1996) 221-239 231
These variables also appear to be linked to a high prevalence (40% or greater) of

pneumonia.
The results of the cluster analysis are presented in Table 7 with farms clustered into
one of four different groups (A-D). In general, the level of pneumonia increased from
group A to D, and the changes in farm characteristics that accompanied this increase in
prevalence can be seen in the table. This analysis shows similar associations to those in
Fig. 1. For example, farms in cluster D tended to have a high prevalence (40% or
greater) of pneumonia, were large (over 1400 pigs year-’ 1, had moderate floor space
but low air volume per pig, did not feed dry food but did floor feed, had older (more
experienced) owners, bought pigs from multiple sources and held slow growing pigs
back from slaughter. These were all factors that clustered at the left side of Fig. 1. On
the other hand, farms in cluster A were small herds with low pig density (less than 2.7
pigs m- ’ ) and large air volumes per pig (3 m3 or greater>. These herds had a low (under
10%) or moderate (1040%) prevalence of pneumonia.
Correspondence analysis provides an overview of the complex relationships among
the dependent and independent variables and this is complementary to the assessment of
the effects of individual variables provided by the analytical modelling approach (Ducrot
and Cimarosti, 1991). In this example, the clustering of factors in Fig. 1 suggests that it
is unlikely that a producer would own a large herd and have large volumes of air per pig
(large herds were associated with low air volumes). The same relationship was evident
in the cluster analysis, large herds tended to be in cluster D while herds with large air
volumes were much more likely to be in cluster A. It ensues from this that it is very
difficult to assess the effect of one factor (e.g. herd size) on pneumonia, with the other
factors being held constant. Even though no serious pairwise multicollinearity problem
(i.e. highly correlated pairs of variables) was noted in this dataset, there were clearly
some important relationships among the independent variables. This represents an
overall and diffuse multicollinearity problem.
6. Conclusions
There is no easy solution to the complex problem of analysing data from a large
number of independent variables. Identifying important associations requires the investi-
gator to have a thorough knowledge of the production system studied as well as the data
collected and probably requires the application of several of the techniques described in
this paper.
While the use of simple statistics to evaluate unconditional associations avoids the
problem of multicollinearity that plagues multivariable analyses, it suffers seriously from
the problem of ‘multiple comparisons’ and fails to consider relationships among the
independent variables. If one variable is found to be a risk factor, another highly
correlated variable will be as well, regardless of its biological association with the
dependent variable. Creating scores or indices to summarise the data enables the
investigator to incorporate their knowledge of the production system into the analysis
but they also have the potential to be arbitrary and biased. They also do not allow for the
evaluation of the role of the individual factors that go into making up the score.
Principal components analysis solves the problem of multicollinearity but it cannot

determine which individual predictor variables have significant associations with the
dependent variable. Factor analysis also solves the problem of multicollinearity but is
only appropriate if there is strong evidence that the proposed underlying factors do exist.
Multiple linear and logistic regression have the advantage that they will identify
individual risk factors that are significantly associated with the outcome of interest, but
they can be seriously adversely affected by the problem of multicollinearity.
Correspondence analysis may be used to complement the analytical procedures
described above. It helps in describing the complex relationships that exist among
variables (both independent and dependent), and produces a low-level graphical repre-
sentation of the relationships. However, it does not assess the statistical significance of
the direct associations between specific independent variables and the dependent vari-
able.
In the future, it may be possible to design studies which investigate a much more
narrow range of risk factors and the problem of large numbers of independent variables
will be reduced. Until then, investigators will have to rely on using a variety of
techniques and integrating the results of their statistical analyses with their knowledge of
the production system being investigated. The choice of method will depend on the
hypothesis(es) tested. If a number of variables all relate to the same basic hypothesis
then the creation of scores or indices or the use of a technique such as factor analysis
may be appropriate. If the objective is to identify individual independent variables that
are associated with the outcome, then techniques which identify their specific effects and
which can take into consideration relationships among independent variables will have
to be used.
References
ADDAD, 1985. Manuel de reference, Version micro 85. M.O. Lebeaux (Editor), France.
Chatfield, C. and Collins, A.J., 1980. Introduction to Multivariate Analysis. Chapman and Hall, London, 246
PP.
Collett, D., 1991. Modelling Binary Data. Chapman and Hall, London, 369 pp.
Dohoo, I.R., 1988. Animal productivity and health information network. Can. Vet. J., 29: 281-287.
Ducrot, C. and Cimarosti, I., 1991. Complementary aspects of the logistic model and of the correspondence
analysis to investigate risk factors in animal pathology: application to the study of orf risk factors in sheep
breeding. In: SW. Martin (Editor), Proc. 6th Int. Symp. Veterinary Epidemiology and Economics, Ottawa,
12-16 August 1991, pp. 97-100.
Ducrot, C., Cimarosti, I., Bugnard, F., van de Wiele, A. and Philipot, J.M., 1994. Risk factors for infertility in
nursing cows linked to calving. Vet. Res., 25 (2-3): 196-202.
Fourichon, C., Madec, F., Pansart, J.F. and Paboeuf, F., 1991. The influence of the choice of class limits on
the results of a factorial analysis of correspondence. In: SW. Martin (Editor), Proc. 6th hit. Symp. on
Veterinary Epidemiology and Economics, Ottawa, 12-16 August 1991, pp. 397-399.
Glantz, S.A. and Slinker, B.K., 1990. Primer of Applied Regression and Analysis of Variance. McGraw-Hill,
New York, 777 pp.
Hoffman, D.L. and Franke, G.R., 1986. Correspondence analysis: graphical representation of categorical data
in marketing research. J. Marketing Res., 23: 213-227.
Hosmer, D.W. and Lemeshow, S., 1989. Applied Logistic Regression. John Wiley, New York, 307 pp.
Humik, D., Dohoo, I.R., Donald, A. and Robinson, N.P., 1994a. Factor analysis of swine farm management
practices on Prince Edward Island. Prev. Vet. Med., 20: 135-146.
I.R. Dohoo et al./ Preventive Veterinary Medicine 29 (1996) 221-239 239
Humik, Cl., Dohoo, I.R. and Bate, L.A., 1994b. Types of farm management as risk factors for swine
respiratory disease. Prev. Vet. Med., 20: 147- 157.
James, CL., 1991. Beef cow/calf productivity and farm management characteristics. M.Sc. Thesis, University
of P.E I., Charlottetown, Canada.
Kleinbaum, D.G., Kupper, L.L. and Morgenstem, H., 1982. Epidemiologic Research-Principles and Quanti-
tative Methods. Lifetime Learning Publications, Belmont, CA, pp. 447-456.
Kleinbaum, DC., Kuppcr, L.L. and Muller, K.E., 1988. Applied Regression Analysis and Other Multivariable
Methods. PWS-Kent Publishing, Boston, 718 pp.
Lafi, S.Q. and Kaneene, J.B., 1992. An explanation of the use of principal-component analysis to detect and
correct for multicollinearity. Prev. Vet. Med., 13: 261-275.
Lebart, L.. Morineau, A. and Warwick, K.M., 1984. Multivariate Descriptive Statistical Analysis: Correspon-
dence Analysis and Related Techniques for Large Matrices. John Wiley, New York.
Levenstein, S., Prantera, C., Varvo, V., Spinella, S., Arca, M. and Bassi, 0.. 1992. Life events, personality and
physical risk factors in recent-onset duodenal ulcer-A preliminary study. J. Clin. Gastroenterol., 14 (3):
203-210.
Madec, F. and Tillon, J.P., 1988. Ecopathologie et facteurs de risque en mtdecine v&inaire-analyse
retrospective (1977- 1987) de l’expkience acquise en Clevage porcin intensif. Rec. Med. Vet., 164 (S-9):
607-616.
Martin, S.W., Meek, A.H. and Willeberg, P., 1987. Veterinary Epidemiology, Principles and Methods. Iowa
State Press, Ames, 343 pp.
Mohammed, H.O.. 1990. A multivariate indexing system for hygiene in relation to the risk of Mycoplasma
gallisrpticum infection in chickens. Prev. Vet. Med., 9: 75-83.
Moles, A.M., 1990. Les sciences de l’imprkis. Seuil, Paris, 253 pp.
Roux, M.. 1991. Basic procedures in hierarchical cluster analysis, interpretation of hierarchical clustering. In:
J. Devillers and W. Karcher (Editors), Applied Multivariate Analysis in SAR and Environmental Studies.
Kluwer, Dordrecht, pp. 115- 152.
Sieber, M.A., Freeman, A.E. and Hinz, P.N., 1987. Factor analysis for evaluating relationships between first
lactation type scores and production data of Holstein dairy cows. J. Dairy Sci., 70: 1018-1026.

An Overview of Techniques For Dealing With Large Numbers of Independent Variables in Epidemiologic Studies

Uploaded by

Copyright:

Available Formats

You might also like

An Overview of Techniques For Dealing With Large Numbers of Independent Variables in Epidemiologic Studies

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Overview of Techniques For Dealing With Large Numbers of Independent Variables in Epidemiologic Studies

Uploaded by

Copyright:

Available Formats

PFEVENTIVE

An overview of techniques for dealing with large

Accepted 13 May 1996

0167.5877/96/$15.00 Copyright 0 1996 Elsevier Science B.V. All rights reserved.

Keywords: Epidemiologic methods; Multivariable analysis; Multicollinearity; Correspondence analysis

1. Introduction and description of the problems

One of the objectives of veterinary epidemiology is the identification and quantifica-

3. Reducing the number of independent variables

3.2. Herd management indices

3.3. Principal components analysis

3.4. Factor analysis

4. Steps and techniques to investigate associations

4.1. Screening simple associations

4.2. Linear and logistic regression

4.3. Correspondence analysis

Correspondence analysis is a form of exploratory data analysis that has been

Observation Disease Fan Herd size

Contingency table arrangement of data for correspondence analysis

Disease Fan ventilation Herd size

0 (No) 1 (Yes) 1 (Small) 2 (Medium) 3 (Large)

‘active’ variables in the projection as close as possible to distances in the original

5 50 neighbour < 700 pigs /year

5. Example: risk factors for pneumonia in swine

5.1. Correlation analysis

size growth cmpfd

I 5.8 0.14 0.30 0.10 -0.12

5.2. Principal components analysis

5.3. Factor analysis

5.4. Simple (unconditional) associations with the dependent variable

A summary of the results of the analyses of the simple (unconditional) associations

5.5. Logistic regression

5.6. Correspondence analysis

Correspondence analysis more clearly demonstrates associations when the categorical

These variables also appear to be linked to a high prevalence (40% or greater) of

Principal components analysis solves the problem of multicollinearity but it cannot

You might also like