ABF Webinar - Episode 5

ANU-ABF WEBINAR
SERIES ON
TRANSNATIONAL
SERIOUS AND
ORGANIZED CRIME
Episode 5: Data Analytics
We acknowledge and
celebrate the First Australians
on whose traditional lands we
Photograph by Adrian Brown, Ngunnawal man, Country ranger,

meet, and pay our respect to
the elders of the Ngunnawal
people past and present
ACT Parks and Conservation Service.

Image: Namadgi National Park.
2 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
What is data analytics?
• Data analytics is the science of transforming, analysing, and presenting raw data to
make conclusions
• Descriptive analytics helps answer questions about what happened.

• Diagnostic analytics helps answer questions about why things happened.
• Predictive analytics helps answer questions about what will happen in the future.
• Prescriptive analytics helps answer questions about what should be done.
BIG DATA
Big data and data analytics
• Data analytics is related to, but separate from ‘big data’
• ‘Big data refers to things that one can do at a large scale that cannot be done at a smaller one, to extract new
insights or create new forms of value, in ways that change markets, organizations, the relationship between
citizens and government, and more.’ Mayer-Schönberger and Cukier (2013)
• ‘As a result of more and more interaction with digital technologies by citizens, and the increasing capability of
these technologies to provide digital trails, new sources of data have emerged and are increasingly available to
official statisticians. Such sources include data from sensor networks and tracking devices e.g. satellites and
mobiles phones, behaviour metrics e.g. search engine queries, and on-line opinion e.g. social media
commentaries. The collective term for such data sources is Big Data.’ Tam and Clarke (2015)
• ‘Big data is a broad term for data sets so large or complex that traditional data processing applications are
inadequate.’ Wikipedia (October 25th, 2015)
5
Big data – What is new?
Einav and Levin (2013) – ‘data is now available faster, has greater coverage and scope, and includes new types of observations
and measurements that previously were not available
Data is available in real time

• The ability to capture and process data in real time is crucial for many business applications. It is beginning to be used for research and policy
Data is available at a larger scale.

• Because data sets often were small, statistical power was an important issue. Nowadays, data sets with tens of millions of distinct observations and huge
numbers of covariates are quite common. In many cases, the large number of observations can make statistical power much less of a concern.
Data is available on novel types of variables.

• Much of the data now being recorded is on activities that previously were very difficult to observe.
Data come with less structure.

• In textbooks, data arrives in “rectangular” form, with N observations and K variables, and with K typically a lot smaller than N. When data simply record a
sequence of events, with no further structure, there are a huge number of ways to move from that recording into a standard rectangular format.
6
Sources of big data – Administrative data
Administrative data is information that is recorded as part of the day-to-day operation of government services but is stored
and can be used for other purposes
• e.g. School attendance data, income support data, tax data, mobile phone calls
Strengths and opportunities

• Relatively low cost
• Relevant for a range of policy purposes (esp. the effect of government services)
• Timely
• Focused on those of greatest policy interest
Weaknesses and challenges
• Privacy concerns
• Lacking information on subjective outcomes (usually)
• Difficult to access and often poorly documented
• Quality of data often difficult to gauge
• Data not available on those not accessing services
7
Confronting different administrative data
sources – Homicide rates
Recorded crime data
• Crime statistics relating to victims of a selected range of personal and household offences that have been recorded by police.
• Data includes a breakdown of the selected offences by victim characteristics (age and sex); the nature of the incident (weapon use and location);
and outcome of police investigations at 30 days.
Cause of deaths data

• Causes of Death collection presents statistics and indicators for all deaths registered in Australia and includes data on sex, selected age groups and
cause of death.
• The statistics are compiled from data made available to the ABS by the Registrar of Births, Deaths and Marriages in each state and territory and by
the National Coronial Information System
• Death through external causes can be further disaggregated to the category of death through assault, which is used as a measure of homicides /
manslaughter
Comparison (Mouzos 2003)

• RC – 1.9; CoD – 1.7 (per 100,000, averaged over 1993-2001)
8
Confronting different administrative data
sources – Homicide rates
9
Sources of big data – Transactional data
Transactional data refers to data that is created when one individual (or group of individuals) interacts with another.
Usually, but not always, an individual buying a good or service from a business.
• e.g. Scanner data, commercial sales, government sales

• Key outcome of interest (e.g. price)
• Transaction is usually geo- and time-coded
• Can be aggregated across industry, jurisdiction or product type
• Can be linked to demographic, geographic or other information

• Commercially valuable and rarely made available to researchers
• Information on buyer often limited or biased
• Definitions may vary across businesses or through time
10
Sources of big data – Social media and search
data
Information that is generated outside of commercial transactions through searches, social media interaction and other online
activity
• e.g. ‘google trends’, Facebook, Tinder

• Can reveal information that individuals are reluctant to share with interviewers
• Timely and inexpensive (in theory)
• Shows networks and other relationships

• Information provided online may not match information available in the ‘real world’
• Can be difficult to access
• Ethics and privacy concerns
11
Gamma, A., Schleifer, R., Weinmann, W., Buadze, A. and Liebrenz, M., 2016.
Could Google Trends be used to predict methamphetamine-related crime? An
analysis of search volume data in Switzerland, Germany, and Austria. PloS one,
11(11), p.e0166566.
12
Sources of big data – Linked data
Information from two or more data sources can be combined (at the unit record level) to provide more information
that could be available from a single data source
• e.g. Census to census, census/survey to administrative data, multiple administrative datasets

• Information across a range domains can be compared
• Can be used to analyse the factors associated with change through time
• Information on same item collected in more than one context
• Can be used for ‘hard to survey’ populations

• Ethical, legislative and privacy concerns
• Many datasets do not have a unique id, requiring probabilistic/statistical techniques
13
Sources of big data – Qualitative data at scale
Qualitative data (words, images, text) is usually analysed at small scale. However, new insights can be gained by
analysing large amounts of data together
• e.g. court records, political texts, Instagram

• Provides a richer description (including subjective information) than quantitative data
• Can be combined with quantitative data from the same source
• Provides new ways to look at historical data pre-dating digitisation

• Much easier to identify individuals and confidential information
• Convenience sampling can reduce external validity
14
Analysis of big data – Issues and solutions
Standard analysis of sample surveys is based on repeated samples of size ‘n’. For example:
• The standard error of the mean (SEM) is the standard deviation of the sample-mean's estimate of a population mean, over repeated samples.
But, when ‘n=all’, this assumption doesn’t hold. The sample estimate is the population estimate
Furthermore – ‘with K > N it typically will be possible to perfectly explain the observed outcomes’ within the data –
Einav and Levin (2013)
More important property is ‘out of sample’ predictions

• Based on the data that we have at a particular point in time/for a particular population, how accurate at the predictions for another point in
time/different populations
Can be tested using a “training sample” used to estimate the model parameters and a “test sample” used to evaluate
performance
15
ANALYSIS OF BIG
DATA
Multiple Linear Regression
Simple: One independent variable.

Multiple: Two or more independent variables (𝑘 ≥ 2).
𝑌! = 𝑠𝑖𝑔𝑛𝑎𝑙 + 𝑛𝑜𝑖𝑠𝑒 = 𝜇! + 𝜀!
𝜇! = 𝛽" + 𝛽# 𝑋!# => Simple

𝜇! = 𝛽" + 𝛽# 𝑋!# + ⋯ + 𝛽$ 𝑋!$ => Multiple
17
Non-Linear regression
Non-Linear regression
• Dependent variable is binary, categorical, censored or truncated
Tobit
• Dependent variable is non-negative.
Binary logit/probit
• Dependent variable is {0,1}. Model probabilities
Multinomial logit/probit
• Dependent variable is categorical, without any particular ordering
Ordered probit
• Dependent variable is categorical with ordering important }
“A notable feature of econometrics is that it tends to focus more on
models that explain than models that predict. This is particularly so if you
compare econometrics to fields like data science or machine learning.”
— J. Starchurski, A Primer in Econometric Theory
19
Introduction to machine learning
• In general, a machine learning (ML) problem considers a n sample of data and then tries to
predict properties of unknown data.
• A ML problem can be of two categories:

1. Supervised learning – Use of labeled datasets (known x variables - also known as features) to
classify and predict y (also known as outcomes).
(i) Regressions – If the outcome variable (y ) is continuous variables.
(ii) Classification – If the outcome variable consists of categorical data. For example, Binary, Ordinal or
Mixed Categories.
2. Unsupervised learning – Use of unlabeled data to learn patterns. For example, when data does
not have known x and y variables. Algorithms include - Clustering, Density estimations,
Decomposition (PCA).
Bias-Variance Trade-off
• The bias-variance trade off is a central problem in supervised learning. Ideally, the optimal model would accurately
capture regularities in the data but also generalizes well to unseen/new data.
• Bias – The bias of an estimator is its average error for different training sets. High bias estimators typically
produce simpler models that may fail to capture important relationships in data (underfitting).
• Variance – The variance of an estimator indicates how sensitive it is to varying training sets. High variance
estimators may be able to represent the training set well but may fail to generalize to new data (overfitting).
B I A S - VA R I A N C E T R A D E O F F
F I G U R E 1: Income and age data

22
23 / 32
F I G U R E 2: Train and test set

24 / 32
F I G U R E 3: Fitting models on train data

F I G U R E 4: Evaluating models on test data
25
26 / 32
F I G U R E 5: Bias-variance tradeoff
Source: http://scott.fortmann-roe.com/docs/BiasVariance.html
EXPERIMENTAL AND
QUASI-
EXPERIMENTAL
METHODS
Examples of evaluation questions
We want to know causal effects, not just association.
• Does a new drug cure more patients?
• Does a subsidised training program improve job prospects for those recently released from prison?
• Does prison prevent crime?
• Does requiring police to wear cameras reduce discrimination?
• Does Hawaii’s Opportunity Probation with Enforcement (HOPE) program which uses a “swift and
sure punishment” approach discourage probation violations?
28
Experimental criminology
Experimental criminology is a family of research methods that involves the controlled

study of cause and effect. Research designs fall into two broad classes: quasi-
experimental and experimental. A research (or evaluation) design is experimental if
subjects are randomly assigned to treatment groups and to control (comparison)
groups. A research (or evaluation) design is quasi-experimental if subjects are not
randomly assigned to the treatment or control conditions but rather if statistical
controls are used to study cause and effect.
Mazerolle and Bennett (2010)
29
Quantitative program evaluation – Overview
and problems
Observable outcome indicator for an individual (Yi)
Particular program (or treatment) aims to bring about improvement in outcome
• (Ti = 1 if in program, Ti = 0 if not)
Individual is assumed to have a different outcome if they received treatment (YiT) than if they did not (YiC), with gains
defined as Gi = YiT - YiC
• Average treatment effect on the treated TT = E(G|T=1)
Missing data - YiT if Ti = 0 and YiC if Ti = 1

• A person cannot be treated and not be treated at the same time.
Only solutions
• Compare outcomes between groups of different people at the same time.
• Compare outcomes for the same group of people at different times (“before-and-after”).
30
The evaluation problem
Ti=0 Ti=1
YiT X ✔
YiC ✔ X
31
The evaluation problem
Ti=0 Ti=1
YiT X ✔
YiC ✔ X
32
Evaluation of a Prisoner Pre-Release, Active
Labour Market Program (I)
Total number of jobhelp prisoners
(n=2001)
Number of prisoners in Number of prisoners in

the treatment group the control group
(n=793) (n=1208)
Match rate to the administrative data:

85% for the treatment group
84% for the control group
Partially Assigned to Treatment

Fully treated treatment but
treated status
(n=214) untreated
(n=102) unclear
(n=501) (n=391)
Most common reasons given for not receiving treatment:

- Insufficient time prior to release (13%)
- Prisoners were not interested (15%)
- Prisoner reporting that they had employment on release (11%)
Evaluation of a Prisoner Pre-Release, Active
Labour Market Program (II)
0.7
Probability of outcome occuring in post-trial group

0.6
0.5
0.4
0.3
0.2
0.1
0
A job-placement A 4-week empl oyment A 12-week employment A 26-week employment Exit from income Recidivism (return to
outcome outcome outcome support prison) as measured in
the income support data.
Outcome measure
Treatment group Control group
Quasi-experimental methods
Natural experiments – Where truly random events or institutional rules create variation in treatment status.
• If independent of potential outcomes, this variation can mimic random assignment in controlled experiments.
Propensity score matching/instrumental variables– Matching or controlling for the predicted probability of group
membership (as a function of a third set of variables (Zi))
• Requires confidence that factors do not jointly influence program placement and outcomes
Discontinuity design – Choose people above and below a certain cut-off as treatment and control groups
Difference in difference estimate –

• Test for difference in change in outcomes through time.
• Only requires the that the selection bias is time invariant. That is, unobservable characteristics may affect the mean of the control/treatment
group, but not the change through time.
36
Ethics with RCTs
Ethical concerns about randomization and denial of access
• Compare ethics of not knowing effects of the policy
• Many programs already deny access to some groups
• Lottery as the fairest allocation among the equally eligible
• Randomized encouragements
Belmont principles establish ethical rules for research:

• Respect: participants should be informed of risks and given a choice about participation
• Benefice: the risks of research should be carefully weighed against the benefits. Risks should be minimized
• Justice: the people (and the type of people) who take the risks should be those who benefit
37
QUALITY
FRAMEWORKS FOR
DATA ANALYTICS
The Total Survey Error Framework
“Total Survey Error refers to the

accumulation of all errors that may
arise in the design, collection,
processing, and analysis of survey
data. A survey error is defined as the
deviation of a survey response from
its underlying true value.”
(Ref: Groves and Lyberg (2010)).
39
Quality frameworks for administrative and
linked data
Less consensus around a guiding framework. But, a few approaches using administrative and/or integrated data
Benzeval et al. (2020) ‘Integrated Data: Research Potential and Data Quality’
• ‘The processes which generate administrative data are very different than those of surveys, but they are not without measurement error, sample
selection and other data errors. The process of linking or appending data can itself generate significant issues which are largely ignored. While
integrated data can solve many problems, the solution is not costless and not always simple’
Reid, Zabala, and Holmberg (2017) ‘Extending TSE to Administrative Data: A Quality Framework and Case Studies
from Stats NZ’
• Phase 1 - how well a data set meets its original, intended purpose
• Phase 2 - problems that can arise when integrating data sets from different sources
• Phase 3 - estimation, design, and evaluation
40
‘Big data’ and research quality
41
APPLICATIONS TO
POLICING
What is predictive policing? (Bachner 2013)
Three categories of analysis techniques that police
departments use to predict crime:
• Analysis of space
• Identification of criminal hot spots, namely areas in which
there is a greater likelihood of crime than in the surrounding
areas.
• Analysis of time and space

• Illustrate how the incidence and spatial distribution of crime
changes over time
• Analysis of social networks

• Primarily used to detect persons of interest, as opposed to
locations of interest
Limits of data analytics
• Privacy concerns
• Many data analytics applications require large datasets, without consent to use for all analytics purposes
• Fairness and biases

• Biases of underlying datasets can be ‘baked in’ to algorithmic decision making
• Procedural fairness – participants in the criminal justice system also need to know how decisions are made
• Data quality and documentation

• Not all data is as it seems
• Data analytics skills and capacity

• Skills in undertaking data analytics, as well as interpreting/using findings
Tom Gauld, New Scientist 9 September 2017
References
• Bachner, J., 2013. Predictive policing: preventing crime with data and analytics. Washington, DC: IBM Center for the Business of Government.
• Benzeval, M., Bollinger, C., Burton, J., Couper, M.P., Crossley, T.F. and Jäckle, A., 2020. Integrated data: research potential and data quality.
Understanding Society Working Paper Series, (2020-02).
• Einav, L. and Levin, J., 2014. The data revolution and economic analysis. Innovation Policy and the Economy, 14(1), pp.1-24.
• Gamma, A., Schleifer, R., Weinmann, W., Buadze, A. and Liebrenz, M., 2016. Could Google Trends be used to predict methamphetamine-related
crime? An analysis of search volume data in Switzerland, Germany, and Austria. PloS one, 11(11), p.e0166566.
• Groves, R.M. and Lyberg, L., 2010. Total survey error: Past, present, and future. Public opinion quarterly, 74(5), pp.849-879.
• Mayer-Schönberger, V. and Cukier, K., 2013. Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt.
• Mazerolle, Lorraine and Bennett, Sarah (2010). Experimental criminology. Oxford Bibliographies: Criminology. Edited by Beth M. Huebner.
Oxford, United Kingdom: Oxford University Press.1-1.https://doi.org/10.1093/OBO/9780195396607-0085
• Mouzos, J., 2003. Australian homicide rates: a comparison of three data sources. Trends and Issues in Crime and Criminal Justice, (261), pp.1-6.
• Reid, G., Zabala, F. and Holmberg, A., 2017. Extending TSE to Administrative Data: A Quality Framework and Case Studies from Stats NZ. Journal
of Official Statistics (JOS), 33(2).
• Stachurski, J., 2016. A Primer in Econometric Theory. Mit Press.
• Tam, S.M. and Clarke, F., 2015. Big data, official statistics and some initiatives by the Australian Bureau of Statistics. International Statistical
Review, 83(3), pp.436-448.

ABF Webinar - Episode 5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ABF Webinar - Episode 5

Uploaded by

Copyright:

Available Formats

ANU-ABF WEBINAR

Photograph by Adrian Brown, Ngunnawal man, Country ranger,

ACT Parks and Conservation Service.

• Descriptive analytics helps answer questions about what happened.

Data is available in real time

Data is available at a larger scale.

Data is available on novel types of variables.

Data come with less structure.

Strengths and opportunities

Cause of deaths data

Comparison (Mouzos 2003)

Strengths and opportunities

Weaknesses and challenges

Strengths and opportunities

Weaknesses and challenges

Strengths and opportunities

Weaknesses and challenges

Strengths and opportunities

Weaknesses and challenges

More important property is ‘out of sample’ predictions

Simple: One independent variable.

𝜇! = 𝛽" + 𝛽# 𝑋!# => Simple

• A ML problem can be of two categories:

F I G U R E 1: Income and age data

F I G U R E 2: Train and test set

F I G U R E 3: Fitting models on train data

F I G U R E 4: Evaluating models on test data

Experimental criminology is a family of research methods that involves the controlled

Mazerolle and Bennett (2010)

Missing data - YiT if Ti = 0 and YiC if Ti = 1

Number of prisoners in Number of prisoners in

Match rate to the administrative data:

Partially Assigned to Treatment

Most common reasons given for not receiving treatment:

Probability of outcome occuring in post-trial group

Treatment group Control group

Difference in difference estimate –

Belmont principles establish ethical rules for research:

“Total Survey Error refers to the

(Ref: Groves and Lyberg (2010)).

• Analysis of time and space

• Analysis of social networks

• Fairness and biases

• Data quality and documentation

• Data analytics skills and capacity

You might also like