Professional Documents
Culture Documents
ABF Webinar - Episode 5
ABF Webinar - Episode 5
SERIES ON
TRANSNATIONAL
SERIOUS AND
ORGANIZED CRIME
Episode 5: Data Analytics
We acknowledge and
celebrate the First Australians
on whose traditional lands we
3 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
BIG DATA
4 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
Big data and data analytics
• Data analytics is related to, but separate from ‘big data’
• ‘Big data refers to things that one can do at a large scale that cannot be done at a smaller one, to extract new
insights or create new forms of value, in ways that change markets, organizations, the relationship between
citizens and government, and more.’ Mayer-Schönberger and Cukier (2013)
• ‘As a result of more and more interaction with digital technologies by citizens, and the increasing capability of
these technologies to provide digital trails, new sources of data have emerged and are increasingly available to
official statisticians. Such sources include data from sensor networks and tracking devices e.g. satellites and
mobiles phones, behaviour metrics e.g. search engine queries, and on-line opinion e.g. social media
commentaries. The collective term for such data sources is Big Data.’ Tam and Clarke (2015)
• ‘Big data is a broad term for data sets so large or complex that traditional data processing applications are
inadequate.’ Wikipedia (October 25th, 2015)
5
Big data – What is new?
Einav and Levin (2013) – ‘data is now available faster, has greater coverage and scope, and includes new types of observations
and measurements that previously were not available
6
Sources of big data – Administrative data
Administrative data is information that is recorded as part of the day-to-day operation of government services but is stored
and can be used for other purposes
• e.g. School attendance data, income support data, tax data, mobile phone calls
7
Confronting different administrative data
sources – Homicide rates
Recorded crime data
• Crime statistics relating to victims of a selected range of personal and household offences that have been recorded by police.
• Data includes a breakdown of the selected offences by victim characteristics (age and sex); the nature of the incident (weapon use and location);
and outcome of police investigations at 30 days.
8
Confronting different administrative data
sources – Homicide rates
9
Sources of big data – Transactional data
Transactional data refers to data that is created when one individual (or group of individuals) interacts with another.
Usually, but not always, an individual buying a good or service from a business.
• e.g. Scanner data, commercial sales, government sales
10
Sources of big data – Social media and search
data
Information that is generated outside of commercial transactions through searches, social media interaction and other online
activity
• e.g. ‘google trends’, Facebook, Tinder
11
Gamma, A., Schleifer, R., Weinmann, W., Buadze, A. and Liebrenz, M., 2016.
Could Google Trends be used to predict methamphetamine-related crime? An
analysis of search volume data in Switzerland, Germany, and Austria. PloS one,
11(11), p.e0166566.
12
Sources of big data – Linked data
Information from two or more data sources can be combined (at the unit record level) to provide more information
that could be available from a single data source
• e.g. Census to census, census/survey to administrative data, multiple administrative datasets
13
Sources of big data – Qualitative data at scale
Qualitative data (words, images, text) is usually analysed at small scale. However, new insights can be gained by
analysing large amounts of data together
• e.g. court records, political texts, Instagram
14
Analysis of big data – Issues and solutions
Standard analysis of sample surveys is based on repeated samples of size ‘n’. For example:
• The standard error of the mean (SEM) is the standard deviation of the sample-mean's estimate of a population mean, over repeated samples.
But, when ‘n=all’, this assumption doesn’t hold. The sample estimate is the population estimate
Furthermore – ‘with K > N it typically will be possible to perfectly explain the observed outcomes’ within the data –
Einav and Levin (2013)
Can be tested using a “training sample” used to estimate the model parameters and a “test sample” used to evaluate
performance
15
ANALYSIS OF BIG
DATA
16 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
Multiple Linear Regression
𝑌! = 𝑠𝑖𝑔𝑛𝑎𝑙 + 𝑛𝑜𝑖𝑠𝑒 = 𝜇! + 𝜀!
17
Non-Linear regression
Non-Linear regression
• Dependent variable is binary, categorical, censored or truncated
Tobit
• Dependent variable is non-negative.
Binary logit/probit
• Dependent variable is {0,1}. Model probabilities
Multinomial logit/probit
• Dependent variable is categorical, without any particular ordering
Ordered probit
• Dependent variable is categorical with ordering important }
18 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
“A notable feature of econometrics is that it tends to focus more on
models that explain than models that predict. This is particularly so if you
compare econometrics to fields like data science or machine learning.”
— J. Starchurski, A Primer in Econometric Theory
19
Introduction to machine learning
• In general, a machine learning (ML) problem considers a n sample of data and then tries to
predict properties of unknown data.
2. Unsupervised learning – Use of unlabeled data to learn patterns. For example, when data does
not have known x and y variables. Algorithms include - Clustering, Density estimations,
Decomposition (PCA).
20 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
Bias-Variance Trade-off
• The bias-variance trade off is a central problem in supervised learning. Ideally, the optimal model would accurately
capture regularities in the data but also generalizes well to unseen/new data.
• Bias – The bias of an estimator is its average error for different training sets. High bias estimators typically
produce simpler models that may fail to capture important relationships in data (underfitting).
• Variance – The variance of an estimator indicates how sensitive it is to varying training sets. High variance
estimators may be able to represent the training set well but may fail to generalize to new data (overfitting).
21 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
B I A S - VA R I A N C E T R A D E O F F
23 / 32
24 / 32
25
26 / 32
F I G U R E 5: Bias-variance tradeoff
Source: http://scott.fortmann-roe.com/docs/BiasVariance.html
EXPERIMENTAL AND
QUASI-
EXPERIMENTAL
METHODS
27 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
Examples of evaluation questions
We want to know causal effects, not just association.
• Does a new drug cure more patients?
• Does a subsidised training program improve job prospects for those recently released from prison?
• Does prison prevent crime?
• Does requiring police to wear cameras reduce discrimination?
• Does Hawaii’s Opportunity Probation with Enforcement (HOPE) program which uses a “swift and
sure punishment” approach discourage probation violations?
28
Experimental criminology
29
Quantitative program evaluation – Overview
and problems
Observable outcome indicator for an individual (Yi)
Particular program (or treatment) aims to bring about improvement in outcome
• (Ti = 1 if in program, Ti = 0 if not)
Individual is assumed to have a different outcome if they received treatment (YiT) than if they did not (YiC), with gains
defined as Gi = YiT - YiC
• Average treatment effect on the treated TT = E(G|T=1)
Only solutions
• Compare outcomes between groups of different people at the same time.
• Compare outcomes for the same group of people at different times (“before-and-after”).
30
The evaluation problem
Ti=0 Ti=1
YiT X ✔
YiC ✔ X
31
The evaluation problem
Ti=0 Ti=1
YiT X ✔
YiC ✔ X
32
33 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
Evaluation of a Prisoner Pre-Release, Active
Labour Market Program (I)
Total number of jobhelp prisoners
(n=2001)
0.5
0.4
0.3
0.2
0.1
0
A job-placement A 4-week empl oyment A 12-week employment A 26-week employment Exit from income Recidivism (return to
outcome outcome outcome support prison) as measured in
the income support data.
Outcome measure
35 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
Quasi-experimental methods
Natural experiments – Where truly random events or institutional rules create variation in treatment status.
• If independent of potential outcomes, this variation can mimic random assignment in controlled experiments.
Propensity score matching/instrumental variables– Matching or controlling for the predicted probability of group
membership (as a function of a third set of variables (Zi))
• Requires confidence that factors do not jointly influence program placement and outcomes
Discontinuity design – Choose people above and below a certain cut-off as treatment and control groups
36
Ethics with RCTs
Ethical concerns about randomization and denial of access
• Compare ethics of not knowing effects of the policy
• Many programs already deny access to some groups
• Lottery as the fairest allocation among the equally eligible
• Randomized encouragements
37
QUALITY
FRAMEWORKS FOR
DATA ANALYTICS
38 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
The Total Survey Error Framework
39
Quality frameworks for administrative and
linked data
Less consensus around a guiding framework. But, a few approaches using administrative and/or integrated data
Benzeval et al. (2020) ‘Integrated Data: Research Potential and Data Quality’
• ‘The processes which generate administrative data are very different than those of surveys, but they are not without measurement error, sample
selection and other data errors. The process of linking or appending data can itself generate significant issues which are largely ignored. While
integrated data can solve many problems, the solution is not costless and not always simple’
Reid, Zabala, and Holmberg (2017) ‘Extending TSE to Administrative Data: A Quality Framework and Case Studies
from Stats NZ’
• Phase 1 - how well a data set meets its original, intended purpose
• Phase 2 - problems that can arise when integrating data sets from different sources
• Phase 3 - estimation, design, and evaluation
40
‘Big data’ and research quality
41
APPLICATIONS TO
POLICING
42 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
What is predictive policing? (Bachner 2013)
Three categories of analysis techniques that police
departments use to predict crime:
• Analysis of space
• Identification of criminal hot spots, namely areas in which
there is a greater likelihood of crime than in the surrounding
areas.
43 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
Limits of data analytics
• Privacy concerns
• Many data analytics applications require large datasets, without consent to use for all analytics purposes
44 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021
Tom Gauld, New Scientist 9 September 2017
References
• Bachner, J., 2013. Predictive policing: preventing crime with data and analytics. Washington, DC: IBM Center for the Business of Government.
• Benzeval, M., Bollinger, C., Burton, J., Couper, M.P., Crossley, T.F. and Jäckle, A., 2020. Integrated data: research potential and data quality.
Understanding Society Working Paper Series, (2020-02).
• Einav, L. and Levin, J., 2014. The data revolution and economic analysis. Innovation Policy and the Economy, 14(1), pp.1-24.
• Gamma, A., Schleifer, R., Weinmann, W., Buadze, A. and Liebrenz, M., 2016. Could Google Trends be used to predict methamphetamine-related
crime? An analysis of search volume data in Switzerland, Germany, and Austria. PloS one, 11(11), p.e0166566.
• Groves, R.M. and Lyberg, L., 2010. Total survey error: Past, present, and future. Public opinion quarterly, 74(5), pp.849-879.
• Mayer-Schönberger, V. and Cukier, K., 2013. Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt.
• Mazerolle, Lorraine and Bennett, Sarah (2010). Experimental criminology. Oxford Bibliographies: Criminology. Edited by Beth M. Huebner.
Oxford, United Kingdom: Oxford University Press.1-1.https://doi.org/10.1093/OBO/9780195396607-0085
• Mouzos, J., 2003. Australian homicide rates: a comparison of three data sources. Trends and Issues in Crime and Criminal Justice, (261), pp.1-6.
• Reid, G., Zabala, F. and Holmberg, A., 2017. Extending TSE to Administrative Data: A Quality Framework and Case Studies from Stats NZ. Journal
of Official Statistics (JOS), 33(2).
• Stachurski, J., 2016. A Primer in Econometric Theory. Mit Press.
• Tam, S.M. and Clarke, F., 2015. Big data, official statistics and some initiatives by the Australian Bureau of Statistics. International Statistical
Review, 83(3), pp.436-448.
46 COLLEGE OF ARTS & SOCIAL SCIENCES | ANU-ABF WEBINAR SERIES ON TRANSNATIONAL SERIOUS & ORGANIZED CRIME 25 NOVEMBER 2021