You are on page 1of 60

MARCH 2023

Administrative data
and Data validation
Workshop II on Monitoring and Evaluation for the Supreme Court

M Maines
Research Associate, IPA
PRE–TEST

PRE-TEST

https://tinyurl.com/scvalid-pre
OUTLINE

Outline

1. What is and why use admin data

2. Information gathering

3. Implementation

4. Challenges using admin data


What is and Why use Administrative Data

Information gathering Implementation


Acquiring data

Locate data source Develop data flow

Understand data Negotiate data use


universe and content agreement

Understand the legal


Get IRB approval
environment

Using Administrative Data: Challenges & Bias


SC WORKSHOP II | ADMIN DATA 7
What is and Why use Administrative Data

Information gathering Implementation


Acquiring data

Locate data source Develop data flow

Understand data Negotiate data use


universe and content agreement

Understand the legal


Get IRB approval
environment

Using Administrative Data: Challenges & Bias


SC WORKSHOP II | ADMIN DATA 8
1 | What is and why use administrative data
What is and why use admin data

What is administrative data?


Information collected, used, and
stored primarily for administrative
(i.e., operational), rather than research
purposes.

Examples:
● Medical records
● Educational records
● Arrest records
● Banking records
● Personnel records Photo credit: Shutterstock | moreimages

● SC: Monthly report of cases

SC Workshop II | Admin Data


DISCUSS:

What are the usual administrative data your


office needs access to?
What is and why use admin data

Why use administrative data?


The outcomes and metrics required for a study may already be
tracked by a government or organization
● Available retrospectively & prospectively
○ Baseline
○ Long-term follow-up
● Reduce logistical burden
● Often cheaper than surveys

SC Workshop II | Admin Data


What is and why use admin data

Why use administrative data?


Using administrative data may address sources of bias & error
common to survey data collection
● Collected at time of occurrence
● Non-self-reported / passively collected
● Near census of relevant population

SC Workshop II | Admin Data


What is and why use admin data

Both survey and admin data have challenges


Challenge Survey data Admin data
Recall and social desirability bias
Attrition
Costs
Logistics
Time
Documentation
Control over study population
More
Control over data processing Challenges

Treatment affects measurement Fewer


Challenges
Incentives to misreport

SC Workshop II | Admin Data


14
2 | Information gathering
Information gathering

Acquiring administrative data


Information gathering Implementation

Locate data source Develop data flow

Understand data universe Negotiate data use


and content agreement

Understand the legal


Get IRB approval
environment

SC Workshop II | Admin Data


Information gathering

Over-arching tip: Get an early start


Plan to get administrative data before the study begins
● Allow the availability of data to inform the analysis plan
● Understand what information you need from participants to ensure access
Plan appropriately for data lags
● Some data is available on a one-year or more lag
● Allow time for the data provider to extract data and transfer
Take advantage of the “down time”
● Prepare analysis code to be ready once data arrive

SC Workshop II | Admin Data


Information gathering

Acquiring administrative data


Information gathering Implementation

Locate data source Develop data flow

Understand data universe Negotiate data use


and content agreement

Understand the legal


Get IRB approval
environment

SC Workshop II | Admin Data


What is and why use admin data

Locate a data source


• Research partners, librarians • Finance
• Banks, credit unions
• Health • Credit rating agencies
• Vital statistics office • Education
• Health facilities (e.g., hospitals, • Schools
clinics)
• Department of education
• Insurance companies
• Statistics department
• Integrated data systems

SC Workshop II | Admin Data


DISCUSS:

What are the sources of administrative data


needed by your office?
Information gathering

Acquiring administrative data


Information gathering Implementation

Locate data source Develop data flow

Understand data universe Negotiate data use


and content agreement

Understand the legal


Get IRB approval
environment

SC Workshop II | Admin Data


S E C T I O Ngathering
Information 2: TEXTLAYOUTS

Understanding the data universe


Data Universe
• Which individuals / instances Cases are filed in the
lower court
are captured in the data
• Reasons for inclusion or
exclusion Cases are logged in
the MRC system by
the lower court
• Example: Pantawid
Pamilyang Pilipino Program
Lower court
submits the
accomplished
MRC form to
SC
Information gathering

Understand the data content


Data Content
● Data dictionaries & documentation

SC Workshop II | Admin Data


Information gathering

Example from the Justice Sector


Monthly Reporting of Cases

Who is in
the data?
What exactly does
the data measure?
What identifiers are available to
link different sources of data?

SC Workshop II | Admin Data


Information gathering

Acquiring administrative data


Information gathering Implementation

Locate data source Develop data flow

Understand data universe Negotiate data use


and content agreement

Understand the legal


Get IRB approval
environment

SC Workshop II | Admin Data


Information gathering

Understand the legal environment


Why?
● Data could contain PII and/or sensitive information
● Laws and regulations govern the responsibilities of data holders
● Access requires data use agreements, IRB approval, data security
● Emphasize trust

SC Workshop II | Admin Data


Game: Who’s that Pokemon?
Game: Who’s that Pokemon?
Game: Who’s that Pokemon?
Game: Who’s that Pokemon?
Game: Who’s that Pokemon?
Information gathering

Personally Identifiable Information (PII)


Identifiable Partially de- De-identified
identified

Image: Shutterstock

More difficult to identify, but Very difficult,


Very easy to identify
still possible, especially impossible, to
individual
with additional knowledge identify

Note: Many combinations of variables can identify individuals

SC Workshop II | Admin Data


3 | Implementation
Acquiring administrative data
Information gathering Implementation

Locate data source Develop data flow

Understand data universe Negotiate data use


and content agreement

Understand the legal


Get IRB approval
environment

SC WORKSHOP II | ADMIN DATA 36


36
Implementation

Data flow
Identified Finder File Administrative Data File

Study Treatment Outcome Outcome


Name DOB SSN Name DOB SSN
ID Status 1 2

De-identified Analysis File

Study Treatment Outcome Outcome


ID Status 1 2

SC Workshop II | Admin Data


Implementation

Data flow
Matching administrative data with program/evaluation data

1. Which identifiers will be used for matching


2. What software and algorithm will be used
3. Which identifiers the research team will have access to
4. Which team will perform the match

SC Workshop II | Admin Data


Implementation

Identifiers for matching


● Understand what identifiers the data agency collects
● Use numeric identifiers (birthday, social security number) instead of
string variables
● Participants may not be willing to provide sensitive identifiers

SC Workshop II | Admin Data


Implementation

Matching process
Exact
● Minor discrepancies are not well accounted for → false negatives

FINDER FILE ADMINISTRATIVE DATA


ID NAME DOB ID NAME DOB

1
Bhebhegurl dela
Torres
5/1/1950 A
Bhebhegurl dela
Torres
1/5/1950 ✗
2 Michael Santos 7/1/1975 B Mike Santos 7/1/1975 ✗
3 Marilyn Cruz 8/23/1987 C Marilyn Cruz 8/23/1987 ✓
SC Workshop II | Admin Data
Implementation

Matching process
Non-Exact Deterministic

● Set of criteria created in advance that two records must meet in order to be
determined a match. → false positive
FINDER FILE ADMINISTRATIVE DATA
ID NAME DOB ID NAME DOB

1
Bhebhegurl dela
Torres
5/1/1950 A
Bhebhegurl dela
Torres
1/5/1950 ✓
2 Michael Santos 7/1/1975 B Mike Santos 7/1/1975 ✓
3 Marilyn Cruz 8/23/1987 C Marilyn Cruz 8/23/1987 ✓
SC Workshop II | Admin Data
Implementation

Matching process
Probabilistic

● Probability that two records belong to the same individual. Techniques include:
○ calculating the similarity of names based on phonetic computation

○ accounting for nicknames or rarity of names

ID NAME DOB ID NAME DOB


Bhebhegurl de la Bhebhegurl Dela
1 5/10/1985 1 10/5/1985
Torres Torres

2 Michael Santos 7/11/1975 2 Michael Santos 11/7/1975 ✗


SC Workshop II | Admin Data
Implementation

Acquiring administrative data


Information gathering Implementation

Locate data source Develop data flow

Understand data universe Negotiate data use


and content agreement

Understand the legal


Get IRB approval
environment

SC Workshop II | Admin Data


Implementation

Data use agreements (DUAs)


Elements
Documents the terms under which a data ❑ Project description
provider shares data with a researcher’s
home institution for use by the researcher ❑ Users and analysts
❑ Data security procedures
Aliases and similar terms:
• Data Sharing Agreement (DSA) ❑ Data to be shared
• Memorandum of Understanding (MOU) ❑ Timeframe
• Non Disclosure Agreement (NDA) ❑ Data destruction
❑ Publication review
❑ Data publication

SC Workshop II | Admin Data


Implementation

Data use agreements (DUAs)


DUAs should be signed by an official institutional representative rather than an
individual PI or staff member

• Ensures proper legal review

• No over-promising

• Prevents unapproved institutional liability

• Ensures no individual liability (to the extent


possible)

SC Workshop II | Admin Data


Implementation

Data use agreements (DUAs)


Templates

Best to start from one of the contracting parties’ preferred templates


• Most universities, as well as some implementing partners and data
providers, have their own templates
• http://nda.mit.edu/

SC Workshop II | Admin Data


Implementation

Data use agreements (DUAs)


Act as an intermediary to help smooth the negotiation

• Understand the legal context – “needs” vs. “wants”

• Establish a good relationship with the data provider

• Emphasize the importance of the research and relevance to the data


provider’s mission (if applicable)

SC Workshop II | Admin Data


4 | Challenges using administrative data
Challenges

Usage of admin data


Two questions to ask:

1. Why were these data collected?


a. Were there incentives or opportunities to misreport information?
b. Can you choose variables that are less susceptible to bias?

2. Which individuals are included in the data and which are excluded,
and why?
a. What steps have to occur before appearing in the data?
b. Does the intervention affect reporting of outcomes?

SC Workshop II | Admin Data


Challenges

Reporting Bias
To address:

● Identify the context in which the data were collected

○ Were there incentives or opportunities to misreport information?

○ Consider direction in which estimates are likely to be biased

● Choose variables that are less susceptible to bias / easier to verify

SC Workshop II | Admin Data


Challenges

Differential coverage and selection bias


Differential probability of appearing in administrative records between
treatment and control

To address:

● Identify the data universe

● Identify how the intervention may affect reporting of outcomes

● Conduct a baseline survey with identifiers for linking

SC Workshop II | Admin Data


Data Validation 101
Data validation

What is data validation?


The process that checks data has our assumed
characteristics.

● “high frequency checks” for survey data


● “data validation” for admin data

SC Workshop II | Admin Data


Data validation

What is data validation?


The process that checks data has our assumed
characteristics.

● Correct: reported data is the observed value (true value +


measurement error)
● Consistent: each row captures the same data

SC Workshop II | Admin Data


Data validation

What is data validation?


The process that checks data has our assumed
characteristics.

• NOT data cleaning (changing the characteristics of the


data we receive)
• NOT ensuring that we have high quality data
• NOT test of statistical validity

SC Workshop II | Admin Data


DISCUSS:

What types of errors do you check your data for?


Data validation

Sources of error/ problems


● Poorly designed forms / systems
○ If upon data entry values aren’t restricted to be entered as a certain
type/in a certain format, this can be problematic.
○ If form isn’t specific enough, can leave ambiguity about how to answer
questions.
● Data entry
○ Mistakes (e.g., misspellings, keystroke errors)
○ Misunderstanding (e.g., annual income vs monthly income)
○ Misreporting/bias (we’ve talked about this)
● Data extraction
● Data pre-preparation

SC Workshop II | Admin Data


SAMPLE WORKFLOW

When do we validate data?


Example: Survey data

Respondent Device Server Researchers Publication

Data
Data
Data storage management Data analysis
collection
& cleaning

– Questionnaire design – Soft checks – High frequency checks


– Enumerator probing – Response constraints (soft logic checks,
outlier detection etc.)
– Duplication checks
– Survey data review

Source: Georgina Evans, Gary King, Adam D. Smith, and Abhradeep Thakurta. Working Paper. “Differentially Private Survey Research”. Copy
at https://j.mp/3jAYXo3

SC Workshop II | Admin Data


SAMPLE WORKFLOW

When do we validate data?


Example: Administrative data
Researcher often does not control the
data generation process

Data generating process


Respondent Device Storage Researchers Publication

Data
Data collection Data storage management Data analysis
& cleaning

– Data quality checks


– Data review
Source: Georgina Evans, Gary King, Adam D. Smith, and Abhradeep Thakurta. Working Paper. “Differentially Private Survey Research”. Copy
at https://j.mp/3jAYXo3

SC Workshop II | Admin Data


Data validation

When should we validate data?


1) As soon as we get the data
• Immediately in time, so we can go back to the provider.
• TIP: Automate to reduce processing time as much as possible.
2) On the most recent delivery of data received
• TIP: Validation scripts should use formats (ex. variable names) we receive
3) Before cleaning, so we can be sure errors are from the data received, not
our modifications.
• Natural to be in process of data conversions to .dta, .RData, or RDBMS

SC Workshop II | Admin Data


SAMPLE WORKFLOW

When should we validate data?


Sample workflow: Two distinct processes for cleaning and validating data

Data delivery Validation Raw data


#n outputs #n

.csv .dta

Anonymized Clean
Raw
Raw
Rawdata
data
Rawdata
Raw data
data raw data Data

SC Workshop II | Admin Data Cleaning workflow


Data validation

Recap: what is data validation


The process that checks data has our assumed
characteristics.
● Correct: reported data is the observed value (true value +
measurement error)
● Consistent: each row captures the same data

Framework to identify and triage errors


in collected data
SC Workshop II | Admin Data
POST–TEST

POST-TEST

https://tinyurl.com/scvalid-post
Maraming salamat po!

You might also like