283 views

Uploaded by gomoria

- Economics Books
- Regression
- Choosing Test
- Crop Insurance Paper
- Financial Analyst
- Impact of Facebook Usage on Students’ Involvement in Studies Doscussion- 15 Dec
- Research Article -HR Old
- Regression
- PO Models With Complex Survey Data
- PASSS RQ2 1Simple Logistic Regression - One Categorical Variable
- Oral Proficiency Standards & Teacher Candidate Standards Flan12030
- Manual Web
- CQE
- Logistic Regression
- 2.15 Exercise Time Travel Lite16
- 15_J_2348
- ersh8320f07_18[1]
- Political Efficacy and Introductory Political Science Course: Findings from a Survey of a Large Public University
- Ch 4 the Research Process an Overview (1)
- bigMNP

You are on page 1of 6

Mark Tranmer, University of Southampton, UK

D. Holt, Office for National Statistics, London, UK

David G. Steel, School of Mathematics and Applied Statistics,

University of Wollongong, NSW 2522, Australia (david_steel~uow.edu.au)

lacy, multilevel models, random effects, grouping variables as auxiliary variables they were able to

variables. reduce the aggregation effects in a set of ecological

correlations by up to 70 per cent.

1. Introduction In social research the variables are usually cat-

egorical at the individual level and the area level

An ecological analysis uses aggregate group level means are the corresponding proportions. Vari-

data to estimate individual level relationships. ous methods of ecological inference in this situa-

The ecological fallacy arises when ecological anal- tion were evaluated by Cleave, Brown and Payne

ysis provides biased estimates of individual level (1995). These included linear ecological regres-

relationships. The groups involved are often small sion as originally suggested by Goodman (1959)

geographic areas such as census Enumeration Dis- and a method based on an Aggregated Compound

tricts (EDs). Suppose that we are interested in in- Multinomial (ACM) model. The ACM model as-

vestigating the relationship between two variables sumed that the frequencies in each group have in-

Y and X. The aggregate data available for area g dependent multinomial distributions and used a

_

usually consists of the group level means Yg and Dirichlet compounding distribution. Their evalu-

ation favoured the ACM based method but they

The ecological fallacy arises because the indi- recognized that it is not easy to implement. Re-

viduals within the groups are not equivalent to cently King (1997) proposed a new method of eco-

randomly formed groups. Progress in understand- logical analysis for categorical data. This method

ing aggregation effects requires allowance for the models the conditional proportions of Y given X

population structure in the model underpinning as random effects with a joint truncated Normal

the analysis. Steel and Holt (1996) proposed a distribution and exploits the constraints implied

model for cases where the relationship between the by the group means.

two variables of interest, Y and X, is linear. This In this paper we develop and evaluate some sim-

model incorporated auxiliary variables, Z, which ple adjusted ecological analysis procedures based

explain much of the within area homogeniety of on the idea of incorporating auxiliary variables

the variables of interest. Random group level coef- and using individual level data available for these

ficients were also included to reflect group level ef- variables.

fects additional to those due to the auxiliary vari-

ables. Based on this model Steel and Holt (1996)

developed a method to adjust group level covari-

ance matrices using limited individual level data 0 Adjusted Ecological Analysis for

available for the auxiliary variables. Steel, Holt Dichotomous Variables

and Tranmer (1996) evaluated this approach for

estimating correlation coefficients using ED level Let Yi and Xi be the values of the variables of

data from the 1991 UK population census. Using interest for the i th individual in the population.

This research was supported by the UK Economic So- Suppose there is a sample, s, of groups and within

cial Research Council, grant Ft000 23 6135 the sampled group, g, a sample of ng individuals

324

is used to calculate sample group means: and King (1997). In both cases the covariates are

used at the area level to explain some of the vari-

1 ZYi and )(g= 1 ~}-~X, ation in the random coefficients characterizing the

relationship between the variables across the ar-

eas. Our approach is motivated by the idea that

The variables are dichotomous and so these means a large part of the variation in the relationships

are proportions. An important special case occurs across areas is due to compositional effects and

when the means are available for all areas and each can be removed by inclusion of the auxiliary vari-

mean is based on all individuals within the rele- ables. If this is the case then handling the remain-

vant area. ing variation between areas should be easier. We

For the sample in the group g let nabg be the propose attempting to average over the auxiliary

number of individuals for which Y~ = a and variables to estimate the marginal relationship be-

Xi = b. The corresponding population counts tween Y and X. This requires information about

are Nabg. We use " + " to indicate summa- the relevant parameters of the individual level dis-

tion over a subscript. Define Pablg = nabg//n++g tribution of the covariates.

and Pablg = Nabg/N++g as the sample and fi- For a single categorical auxiliary variable the

nite population proportions for group g. The con- information needed to calculate the adjusted

ditional proportions are Palbg = nabg//n+bg and marginal probabilities consists only of the propor-

Palbg = Nabg/N+bg. Define Pab+ -- nab+l/n+++ tions in each category. These can be calculated

and Pab+ = Nab+/N+++ as the overall sample and from the weighted group level data or could come

finite population proportions. The corresponding from some other source, such as a survey. If two or

conditional proportions are Palb+ = nab+/n+b+ more categorical auxiliary variables are used then

and Palb+ = Nab+ /N+b+. the marginal cross tabulations of these variables

The basis of the approach proposed by Steel and are required, unless further assumptions can be

Holt (1996) is that, for continuous variables, the made concerning the relationship between the aux-

conditional probability density function of Y given iliary variables. No individual level data about the

X can be expressed as variables of direct interest are used.

f f(xlz)f(z)dz (1) 3. Empirical Evaluation

When the parameters of f(ylx, z), f(zlz), and f(z)

Individual and group level data from the 1991 UK

are distinct, analysis can proceed by using individ-

population census were used to evaluate several

ual level data to estimate the parameters of f(z)

methods. The Small Area Statistics (SAS) data-

and aggregate data are used to estimate the pa-

base provided data in the form of totals for a range

rameters of f (ylx, z)and f (xlz).

of categorical variables for EDs. This was the

Assume t h a t there is a single auxiliary variable

source of the group means Yg, )fg and Zg. For the

Z and t h a t Y, X and Z are all categorical. The

variables analysed in this paper these means are

target of inference is the conditional probability

all based on 100 per cent of the census records for

distribution of Y given X, P(YIX). The approach

the relevant ED. Individual level data are available

we develop here is based on

from a 2 percent Sample of Anonymized Records

p ( y i x ) = F_,z P(YI X, Z ) P ( X I Z ) P ( Z ) (SAR) for Local Authority Districts (LADs). The

~-,z P ( X I Z ) P ( Z ) (2) evaluation used data for the LAD of Manchester,

which contained 897 EDs and 7613 individuals in

Estimation of P(YIX) can be a t t e m p t e d by us- the SAR, of which 5802 were aged 16 or more.

ing aggregate data to estimate P(YIX, Z) and Using these data it is possible to calculate ad-

P(XIZ ) and then using the individual level data justed ecological estimates of the marginal proba-

to estimate P(Z). The analysis using aggregate bility distribution P(YIX) based on equation (2)

data will be based on linear or logistic regression. using Yg, Xg and Zg obtained from the SAS data-

The potential benefit of using group level in- base and using individual level data obtained from

formation about covariates in ecological analysis the SAR concerning variables chosen as auxiliary

is discussed by Cleave, Brown and Payne (1995) variables. In this evaluation we considered the

325

following estimators of the conditional probability data to estimate P(YIX, Z) and P(XIZ). Es-

P(Y = I IX = b) = 7rllb, for b = 0, 1. timation of P(Y[X) is then based on equation

(2) with the individual level data being used

(a) SAR relative frequencies. The relative fre- to estimate P ( Z ) .

quency obtained from the SAR, nlb+/n+b+.

(g) Adjusted correlation approach. For two di-

(b) SAS relative frequencies. For pairs of vari- chotomous variables the correlation combined

ables for which the SAS contains the relevant with the marginal totals determines the pro-

cross tabulation we can calculate Nlb+ IN+b+. portions in the cross classification i.e.

Ply+ = PI++P+~+ +

(c) Ecological linear regression. This is built

around the relationship Ryxv/PI++(1 - P l + + ) P + l + ( 1 - P+l+) (4)

where R y x is the correlation coefficient based

~'g = P11og(1 - X . ) -4- P111g)(. (3) on the table of proportions P~b+. The method

proposed by Steel and Holt (1996) is used

If Pll0g and Pxllg are random variables with to produce adjusted estimates of the correla-

E[P~Io, IXg] - 7r~10 and E [ ~ I ~ , I X , ] _ - 7r11~ tion R y x ( Z ) which can then be substituted

then a linear regression of Yg on Xg gives into equation (4). This method enables use of

unbiased estimates of 7r110 and 7r111. This is information about several auxiliary variables

the classical Goodman regression approach even if only the two way cross tabulations are

(Goodman, 1959). This approach can pro- available.

duce estimates outside [0,1] and simple direct

use of this model is not usually recommended. (h) King's ecological inference method. This

method is also built around equation (3). The

(d) Ecological logistic regression. A common group level conditional proportions Pll0g and

model used in analysing a dichotomous re- Pl[lg are assumed to have a joint Normal dis-

sponse variable is logistic regression which as- tribution which is truncated so that they each

sumes Y~ is a Binomial variable based on one lie in the [0,1] interval. The method incorpo-

trial, B(1, E[Y~IXi]), where rates the fact that given ]?g and if g, equation

(3) implies bounds and constraints for Plllg

E[r, lx,] ) and Pll0g. This approach produces estimates

log 1 - E[Y~IX,] - ~ + DX~

of Plllg and Pll0g which can then be com-

bined to produce estimates of Plfl and/9110.

If groups were completely homogeneous with

This method can incorporate group level co-

respect to X then each group total Yg would

variates to help model the variation between

be a Binomial variable based on Ng trials with

groups.

probability such that

The variables used in the analysis were as fol-

E[~alXg] ) lows:

log 1 - -~[Ygg~ffg] - oL+ Z f( g

Y: employed, unemployed,

X" marital status,

Groups are not homogeneous with respect to

Z: age 45-59, age 60+, living in owner occupied

X but this model has the advantage of not

dwelling, renting from local authority.

giving predicted probabilities outside [0,1].

These auxiliary variables were chosen because of

(e) Adjusted ecological linear regression. Here we their success in removing aggregation effects in

use linear regression based on aggregate data correlations in the evaluation by Steel, Holt and

to estimate P(YIX, Z) and P(XIZ). Estima- Tranmer (1996). The analysis was confined to

tion of P ( Y I X ) i s then based on equation (2) those aged 16 or more.

with the individual level data being used to Table 1 gives the estimated probabilities of be-

estimate P(Z). ing employed (Y = 1) given marital status (X).

The estimates are obtained using the methods

(f) Adjusted ecological logistic regression. Here listed above. The ecological methods and adjusted

we use logistic regression based on aggregate correlation methods were implemented using the

326

SAS package and King's method was implemented mative bounds. However, the estimates are still

using EzI software, developed by King and col- worse than those obtained from the adusted logis-

leagues (King, 1997). The first row of the table tic regression method. When the variable "owner

give the "true" values of the probabilities as es- occupied" is used as a covariate, the estimates ob-

timated from census cross tabulations, available tained from King's method are effectively equal to

for these particular variables from the SAS data the true values, while those based on the adjusted

base, which were the same as the corresponding linear and logistic regression methods are not.

estimates obtained from the SAR. While these results are limited to two relation-

ships, they highlight the importance of using aux-

When adjustment variables are not included,

iliary variables in ecological analysis to obtain rea-

the ecological linear and ecological logistic esti-

sonable estimates. The choice of auxiliary vari-

mates are considerably different from the true val-

ables is important and methods of identifying ef-

ues. The inclusion of "owner occupied housing"

fective adjustment variables need to be used, as

as an adjustment variable leads to ecological lin-

suggested by Steel and Holt (1996). If appropriate

ear and logistic estimates that are much closer to

auxiliary variables are used, quite simple methods

the true values. Used as a single adjustment vari-

can perform almost as well as more sophisticated,

able "aged 60 and over" does not improve the es-

computer intensive ones. We expect that the sim-

timates. The estimates obtained by the adjusted

ple adjusted ecological methods can be extended

ecological linear method are fairly close to the true

to also include random effects within the sort of

values when several adjustment variables are used.

multilevel framework developed, for example, by

In particular, those combinations of adjustment

Goldstein (1995). Methods that solely use ran-

variables that include housing tenure. The ad-

dom effects to account for the variation in the re-

justed linear regression method generally works

lationship between groups will, in general, not be

better than the adjusted correlation method sug-

very successful. More information needs to be in-

gested by Steel and Holt (1996). In this exam-

corporated in order to get useful estimates. This

ple, King's method without covariates gives results

information may be in the form of individual level

similar to those for the ecological linear and lo-

data on auxiliary variables, group level covariates

gistic methods without covariates. When "owner

or the constraints exploited in King's method.

occupied" is used as a covariates, the estimated

probabilities are further from the true values than

those obtained using the adjusted linear regres- 4. References

sion method. However, when the covariate "rent-

ing from a local authority" is added the estimates Cleave, N., Brown, P.J. and C. D. Payne (1995)

from King's method are slightly better than those Evaluation of Methods for Ecological Inference.

based on the adjusted linear regression method. Journal of the Royal Statistical Society, A, 158,

Use of King's method with more than two covari- pp 55- 72

ates proved difficult in practice and no results for Goldstein, H. (1995). Multilevel Statistical Mod-

such cases are included. els, 2nd Edition, Edward Arnold, London.

Goodman, L.A. (1959). Some alternatives to Eco-

For the estimates of the proportion unemployed logical regression. American Journal of Sociologi-

by marital status given in Table 2, the unad- cal Review, 18, 663-664.

justed ecological linear and logistic methods pro- King, G. (1997). A Solution to the Ecological In-

vide poor estimates. The linear method leads to ference Problem :Reconstructing Individual Behav-

an out of range estimate. Including "owner occu- ior from Aggregate Data. Princeton Univ Press

pied" as an adjustment variable in these meth- Steel, D. and Holt, D. (1996). Analysing and Ad-

ods leads to some improvement. The adjusted justing Aggregation Effects: The Ecological Fal-

ecological linear regression method works reason- lacy Revisted. International Statistical Review,

ably well for combinations of adjustment vari- 64, pp 39-6O

ables that include housing tenure and age. King's Steel D., Holt D. and Tranmer M. (1996). Making

method without covariates works somewhat bet- unit level inferences from aggregated data, Survey

ter in this case than in the previous example, be- Methodology 22, 3-15

cause differences in the proportions unemployed

and married across the EDs lead to more infor-

327

T a b l e 1: E s t i m a t e d p r o b a b i l i t i e s : Y - e m p l o y e d ; X - m a r r i e d

SAS 'truth' .41 .50 SAS 'truth' .41 .50

covariate( s): covariate(s):

none .26 .67 none .26 .67

age2 .23 .71 age60+ .29 .65

age1, age2 .28 .65 age1, age2 .33 .59

oo .48 .42 oo .49 .41

oo, rla .47 .43 oo, rla .46 .44

oo, rla, age1,2 .45 .45 oo, rla, age1,2 .45 .46

SAS 'truth' .41 .50 SAS 'truth' .41 .50

covariate(s)" covariate(s):

none .27 .66 none .25 .69

age2 .22 .70 age60+

age1, age2 + .20 .73 age4559, age60+

oo .55 .35 oo .50 .39

oo, rla .51 .39 oo, rla .44 .46

oo, rla, agel,2 .47 .43 oo, rla, agel,2

Population: Residents aged 16 or more in households, Manchester LAD.

Y takes the value 1 for 'employed' 0 for 'not employed'.

X takes the value 1 for 'married' and 0 for 'not married'.

Pllo means P ( Y = 1 I X = 0); Plll means P ( Y = 1 I X = 1)

Adjustment variables: oo - owner occupied; rla - rented from local authority;

age1 = persons aged 45 - 59; age2 = persons aged 60 and over.

328

T a b l e 2: E s t i m a t e d p r o b a b i l i t i e s : Y - u n e m p l o y e d ; X - m a r r i e d

SAS 'truth' .14 .07 SAS 'truth' .14 .07

covariate(s)" covariate(s)"

none .24 -.06 none .17 .02

age2 .24 -.05 age60+ .17 .02

agel, age2 .22 -.03 agel, age2 .11 .09

oo .18 .01 oo .16 .04

oo, rla .19 .01 oo, rla .15 .04

oo, rla, agel,2 .16 .04 oo, rla, agel,2 .11 .09

SAS 'truth' .14 .07 SAS 'truth' .14 .07

covariate(s)- covariate(s)"

none .27 -.09 none .19 .00

age2 .27 -.09 age2

agel, age2 .28 -.09 agel, age2

oo .19 -.00 oo .14 .07

oo, rla .18 .01 oo, rla

oo, rla, agel,2 .16 .03 oo, rla, agel,2

Population: Residents aged 16 or more in households, Manchester LAD.

Y takes the value 1 for 'unemployed' 0 for 'not unemployed'.

X takes the value 1 for 'married' and 0 for 'not married'.

Pil0 means P(Y = l I X = 0); Pili means P(Y = l I X = 1)

Adjustment variables: oo = owner occupied; rla = rented from local authority;

agel = persons aged 45 - 59; age2 = persons aged 60 and over.

329

- Economics BooksUploaded bybabzz01
- RegressionUploaded byCynthia Smith
- Choosing TestUploaded bypmiklos2
- Crop Insurance PaperUploaded byShiv Raj Singh Rathore
- Financial AnalystUploaded byapi-76955184
- Impact of Facebook Usage on Students’ Involvement in Studies Doscussion- 15 DecUploaded bygarimaarora
- Research Article -HR OldUploaded bysehrish23
- RegressionUploaded byMansi Chugh
- PO Models With Complex Survey DataUploaded byakrause
- PASSS RQ2 1Simple Logistic Regression - One Categorical VariableUploaded byNguyen Bich Ngoc
- Oral Proficiency Standards & Teacher Candidate Standards Flan12030Uploaded byfaze2
- Manual WebUploaded bym
- CQEUploaded byrajaabid
- Logistic RegressionUploaded byMohamed Med
- 2.15 Exercise Time Travel Lite16Uploaded byaustronesian
- 15_J_2348Uploaded byAnasA.Qatanani
- ersh8320f07_18[1]Uploaded bynewhaven01
- Political Efficacy and Introductory Political Science Course: Findings from a Survey of a Large Public UniversityUploaded byMiguel Centellas
- Ch 4 the Research Process an Overview (1)Uploaded byClarenciaAdhemesTantri
- bigMNPUploaded byBryant Shi
- SPT-Based liquefaction triggering procedures.pdfUploaded bydnavarro
- P86Uploaded byAriel
- orioUploaded byChidochashe Chydide Chinogurei
- 2_GEOSTAT_BasicStatisticsUploaded byESHANU
- Idriss Boulanger SPT Liquefaction CGM-10-02Uploaded byJose Perez
- histogram kepemimpinanUploaded bySekar Kinasih
- Project 1 EducationUploaded bypoop2269
- Speech DelayedUploaded byNadia Sani amalia
- Which Obama VotersUploaded byLuigiScaini
- Economic Analysis of the Utilization of Disused Biomass From the Agricultural Activity in the Region of Thessaloniki-1Uploaded byAnonymous uxlBscW9

- 07 Simple Linear Regression Part2Uploaded byRama Dulce
- Predictive Analytics White PaperUploaded byKrs Here
- Frenzel+Et+Al+Emotions+and+Learning+2007Uploaded byKim SeongJun
- DNV-RP-J101 Use of Remote Sensing for Wind Energy Assessments April 2011Uploaded byTroy
- related paper2.pdfUploaded bywakeel zaman
- Applied Bayesian Econometrics for Central BankersUploaded byWendeline Kibwe
- DSME2040 Regression Students(1)Uploaded byAndy Nam
- Ejemplo de Analisis Multivariado Con R.Uploaded byzedanes
- A Modelling Approach Using Seedbank and SoilUploaded byGolub Marko
- Pediatric Short-bowel Syndrome- The Cost of Comprehensive CareUploaded byAffri Dian Adiyatna
- Applied Statistics HandbookUploaded byParamesh
- SPSS ManualUploaded byLumy Ungureanu
- Econometricscoursoutline MBAMS.docUploaded bysultanatkhan
- JP_chandon wansink laurent benefit congruency JM 00Uploaded byRajesh Kumar K
- A Smart Guide to Dummy VariablesUploaded byjtth
- Day1Uploaded byAnonymous BKnJpeyh
- Use SPC for Everyday Work Processes.pdfUploaded byRamon G Pacheco
- SPSS AnalysisUploaded byCortney Moss
- Module II Lec2Uploaded bydinesh111180
- MAS.M-1414. Cost concepts, Classification and Segregation.MC.docxUploaded bychowchow123
- AccUploaded byharikrishna k s
- mseschsyllUploaded bySAMEERA
- 2 (Exposure-Affecting Factors of Dairy Farmers Exposure to Inhalable Dust and Endotoxin)Uploaded byPrima Dhika Ayu
- Multiple Regression - HandoutUploaded byMarkus Santos
- Demand Estimation and ForecastingUploaded byAsnake Geremew
- Regression (Basic Concepts)Uploaded byNavneet Kumar
- Solutions to assigned problems in CH10.docUploaded byPJ Demonteverde Bantolo
- Parenting Internet StyleUploaded byRizka Sitti Muliya
- simple linear reg statUploaded bychrisadin
- Numerical Ch17Uploaded byVinodh Kumar