You are on page 1of 27

Linking the DAMES & e-Stat Nodes

Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting

DAMES is the ‘Data Management through e-Social Science’ research


Node , www.dames.org.uk
1. Some background on DAMES

2. First thoughts on linking DAMES and e-Stat

3. Some proposals on usability / services

2
1) Data Management though e-
Social Science

 DAMES – www.dames.org.uk
 ESRC Node funded 2008-2011
 Aim: Useful social science provisions
 Specialist data topics – occupations; education qualifications;
ethnicity; social care; health
 Mainstream packages and accessible resources
 Aim: To exploit/engage with existing DM resources
 In social science – e.g. ESDS, CESSDA
 In e-Science – e.g. OGSA-DAI; OMII

3
To us ‘Data management’ means…
 ‘the tasks associated with linking related data resources, with
coding and re-coding data in a consistent manner, and with
accessing related data resources and combining them within the
process of analysis’ […DAMES Node..]

 Usually performed by social scientists themselves


• Pre-analysis tasks (though often revised/updated)
• Inputs also from data providers
 Usually a substantial component of the work process
• But may not be explicitly rewarded (and sometimes penalised)
 differentiate from archiving / controlling data itself

4
Some components…
 Manipulating data
 Recoding categories / ‘operationalising’ variables
 Linking data
 Linking related data (e.g. longitudinal studies)
 combining / enhancing data (e.g. linking micro- and macro-data)
 Secure access to data
 Linking data with different levels of access permission
 Detailed access to micro-data cf. access restrictions
 Harmonisation standards
 Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’)
 Recommendations on particular ‘variable constructions’
 Cleaning data
 ‘missing values’; implausible responses; extreme values

5
Example – recoding data
Count
educ4
3.00 Higher 4.00 S chool
1.00 2.00 sc hool or level or
-9. 00 Degree Diploma vocational below Total
Highest -9 Mis sing or wild 323 0 0 0 0 323
educat ional -7 Proxy respondent 982 0 0 0 0 982
qualific ation
1 Higher Degree 0 425 0 0 0 425
2 Firs t Degree 0 1597 0 0 0 1597
3 Teaching QF 0 0 340 0 0 340
4 Other Higher QF 0 0 3434 0 0 3434
5 Nurs ing QF 0 0 161 0 0 161
6 GCE A Levels 0 0 0 1811 0 1811
7 GCE O Levels or Equiv 0 0 0 0 2518 2518
8 Commercial QF, No O
0 0 0 331 0 331
Levels
9 CSE Grade 2-5, Scot
0 0 0 0 421 421
Grade 4-5
10 Apprenticeship 0 0 0 257 0 257
11 Ot her QF 102 0 0 0 0 102
12 No QF 0 0 0 0 2787 2787
13 Still At School No QF 138 0 60 0 0 138
Total 1545 2022 3935 2399 5726 15627
Example –Linking data
Linking via ‘ojbsoc00’ :
c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk

7
Matching files (‘deterministic’)
Complex data (complex research) is distributed across
different files. In surveys, use key linking variables for...
 One-to-one matching
SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid.
Stata: merge pid using file2.dta

 One-to-many matching (‘table distribution’)


SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid .
Stata: merge pid using file2.dta

 Many-to-one matching (‘aggregation’)


SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income) /break=pid.
Stata: collapse (mean) meaninc=income, by(pid)

 Many-to-Many matches

 Related cases matching


8
A bit of focus…
 I tend to emphasise two data management
activities:

1) Variable constructions
o Coding and re-coding values
2) Linking datasets
o Internal and external linkages

9
..plus the centrality of keeping clear records
of DM activities
Reproducible (for self)
Replicable (for all)
Paper trail for whole
lifecycle
Cf. Dale 2006; Freese 2007

 In survey research,
this means using
clearly annotated
syntax files
(e.g. SPSS/Stata)
Syntax Examples:
www.longitudinal.stir.ac.uk

10
Principle DAMES services
(current status)

 GESDE specialist data environments (prototypes)


Occupations, educational qualifications, ethnicity
 Data curation tool (prototype)
 Data fusion tool (prototype)

 Secure data demonstrator for e-Health research (complete)


 Micro-simulation model for social care data (prototype)
 Training workshops and events (in progress)

11
GEMDE – Grid Enabled Specialist Data Environments

12
GEODE –
Occupational data
Data curation tool

The curation tool


obtains metadata
and supports the
storage and
organisation of
data resources in a
more generic way

14
Data fusion
tool

15
2. Linking DAMES and e-Stat
High level vision is to ingrain data
management functionality and uptake within
e-Stat modelling capabilities

- Using/adapting DAMES contributions


- DAMES services for data linking
- DAMES resources for recoding variables
- Making replication central to the data story
16
Data and variables
 DAMES does not in general provide routes to
new/alternative microdata, but to relevant
supplementary data (e.g. aggregate data)

Anything on educational qualifications,


occupations, ethnicity is of particular interest
Generic tools for merging micro-data
Generic tools for other variable processes

17
Data oriented review
 Applied research perspective
 Range of data resources
 Accessing and documenting data resource
options

18
The implementation for e-Stat
 This is mostly a blank space…
 …and we’ve not hitherto used Python

Data curation tool and GEODE/GEEDE use


IRODS
GEMDE uses a bespoke SQL database
Data fusion tool uses R (and some Stata)
scripts accessed via a Liferay portal
3. A pitch for specific e-Stat facilities

..harvest the best of data analysis packages


from applied data perspective

Replication in ‘human readable syntax’


Something like Stata’s ‘est store’ for multiple
model comparisons
Fluency in data oriented options
Training resources in data

20
 Est store demo here

21
Appendix items

22
Model 1:
Data file specification Variable manipulation & analysis

Spouse
BHPS, wave Analytical CAMSIS
Graphics
A individuals file Spouse
SOC Gender Current job
Age RGSC
Wave BHPS wave Age
C B individuals. (yrs)
bands

-> usedataset{UKDA_5151}
DAMES most -> usedatafile{individuals wave A}
common
commands: -> matchdata{individuals wave A;individuals wave B; link
Text variable=pid; format=wide}
interface -> SPSS{match files file=“aindresp.sav” /file=“bindresp.sav”
/by=pid} Invoked manually
Commands -> SPSS{fre var=ajbrgsc} or in response to
invoking other
packages
-> Stata{recode ageb 16/30=1 31/50=2 *=.} manipulating
-> R{..} graphs
-> Stata{do $path2\part1_analysis.do}
23
‘The significance of data management for social survey
research’
(see http://www.esds.ac.uk/news/eventdetail.asp?id=2151)

 The data manipulations described above are a major component of the


social survey research workload
 Pre-release manipulations performed by distributors / archivists
• Coding measures into standard categories
• Dealing with missing records
 Post-release manipulations performed by researchers
• Re-coding measures into simple categories

 We do have existing tools, facilities and expert experience to help


us…but we don’t make a good job of using them efficiently or
consistently

 So the ‘significance’ of DM is about how much better research might be


if we did things more effectively…

24
Some provocative examples for the UK…
 Social mobility is increasing, not decreasing!
− Popularity of controversial findings associated with Blanden et al (2004)
− Contradicted by wider ranging datasets and/or better measures of stratification position
− DM: researchers ought to be able to more easily access wider data and better variables

 Degrees, MSc’s and PhD’s are getting easier!


− {or at least, more people are getting such qualifications}
− Correlates with measures of education are changing over time
− DM: facility in identifying qualification categories & standardising their relative value within
age/cohort/gender distributions isn’t, but should, and could, be widespread

 ‘Black-Caribbeans’ are not disappearing!


− As the 1948-70 immigrant cohort ages, the ‘Black-Caribbean’ group is decreasingly
prominent due to return migration and social integration of immigrant descendants
− Data collectors under-pressure to measure large groups only
− DM: It ought to remain easy to access and analyse survey data on Black-Caribbean’s, such
as by merging survey data sources and/or linking with suitable summary measures

25
Comment – growing interest in data
management..?
 Historically, references covering DM were few and far between
• Dale, A., Arber, S., & Procter, M. (1988). Doing Secondary Analysis. London:
Unwin Hyman Ltd.
 Recently, there’s been a small burst of relevant references
• Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS
Statistics 17.0. Chicago, Il.: SPSS Inc. .
• Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC
Press.
• Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test
Ideas. New York: Jossey Bass.
• http://www.esds.ac.uk/support/onlineguides.asp
• http://www.longitudinal.stir.ac.uk/
 ..and growing interest re. ‘documentation for replication’
• Dale, A. (2006). Quality Issues with Survey Research. International Journal of
Social Research Methodology, 9(2), 143-158.
• Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not
Sociology? Sociological Methods and Research, 36(2), 2007.

26
E-Science and Data Management
E-Science isn’t essential to good DM, but it has capacity to improve
and support conduct of DM…
1. Concern with standards setting
in communication and enhancement of data
2. Linking distributed/heterogeneous/dynamic data
Coordinating disparate resources; interrogating live resources
3) Contribution of metadata
tools/standards for variable harmonisation and standardisation
4) Linking data subject to different security levels

5) The workflow nature of many DM tasks

27

You might also like