You are on page 1of 52

Bayesian Network Modeling of Offender Behavior for Criminal Profiling

Masters Defense Kelli Crews Baumgartner Laboratory for Intelligent Systems & Control Mechanical Engineering Duke University
10/18/2007 1

Outline
• • • • • • • Background to Criminal Profiling Motivation Bayesian Networks (BNs) Methodology Results Future Works Acknowledgments
2

10/18/2007

The Case of the Mad Bomber
• • • • Male Eastern European Late 40’s to early 50’s Lived in Connecticut with a maiden aunt or sister • Paranoia • Wearing a doublebreasted suit, buttoned • • • • George Meteskey Slavic descent 54 years old Lived in Connecticut with two maiden sisters • Acute Paranoia • When apprehended, was wearing a double breasted suitbuttoned
3

10/18/2007

Background
• Goals of criminal profiling:
– Concentrate criminal investigations – Interrogation strategies

• Collaboration with law enforcement

10/18/2007

4

Previous Research
• Early 1980’s: FBI developed the organized/disorganized dichotomy • 1985 -1994: Dr. David Canter expanded the FBI model: interpersonal coherence, time and place, criminal characteristics, criminal behavior, and forensic awareness • 1999-Present: G. Salfati and D. Canter explore the expressive/instrumental dichotomy
10/18/2007 5

Expressive vs. Instrumental
• Uses multidimensional scaling (MDS)
– Non-metric multidimensional scaling procedure – Plots association coefficients for each crime scene behavior – similarly themed actions will co-occur in the same region of the plot, and vice versa

• Single offender-single victim homicides by the British police (82 cases and 247 cases)
10/18/2007 6

Results of MDS Research
• 62% of the cases exhibited a majority of the crime scene characteristics in a single theme • 74% of all offenders could also be classified as either expressive or instrumental • 55% of the cases exhibited the same theme in both their crime scene actions and in the offender background characteristics • High Frequency crime scene behaviors can be eliminated
10/18/2007 7

Motivation
• Develop a network model linking the profile of the offender to his/her decisions and behaviors at the crime scene • Determine correlations between input and output variables based on data from real cases – Input variables: Crime Scene Analysis (CSA), victimology assessment, and medical examination. Output variables: Offender profile: sex of offender, prior convictions, relationship with victim – Training data: Solved cases (inputs/outputs known) • Apply software to produce the offender profile for unsolved cases, given the input variables
10/18/2007 8

Criminal Profiling Software Development
• • • Use expert knowledge to initialize BN variables Train the NM with solved cases Test NM with validation cases by inputting the crime scene variables and comparing the predicted offender variables to the observed offender variables. Data Initial Model Inputs Trained Model Outputs

10/18/2007

9

Belief Networks
• Bayesian Networks, B=(S,Θ)
– Probabilistic Network

P(X1) X1 X2
P(x1,1 ) 0.4 P(x1,2 ) 0.6

xi,j =x variable,state
P(X2|X1)
P(x2,1 | X1 ) P(x2,2 | X1 ) X1=x1,1 X1=x1,2
10/18/2007

X3

P(X3 | X1)
0.9 0.5
10

P(x3,1 | X1 ) P(x3,2 | X1 ) X1=x1,1 X1=x1,2 0.1 0.5

0.8 0.3

0.2 0.7

Example of Inference
• X2 and X3 are observed • X1 is unknown, to be inferred
P(X2|X1) X2 P(X1) X1

X3

P(X3 | X1)

• Possible states :

⎧ X 1 = {x1,1 ,..., x1,r1 } ⎪ ⎨ X 2 = {x 2,1 ,..., x 2,r 2 } ⎪ X = {x ,..., x } 3,1 3, r 3 ⎩ 3
11

10/18/2007

Bayes Theorem
• Bayes Rule* to infer X1 when X2=x2,r2 and X3=x3,r3:
P ( X 1 | x 2 , r 2 , x 3, r 3 ) = P ( x 2 , r 2 , x 3, r 3 | X 1 ) P ( X 1 ) P( x 2,r 2 , x s ,r 3 )
3

• Marginalization of the observed variables:

P( x 2,r 2 , x3,r 3 ) = ∑ P( X 1 = x1,k )∏ P( x j ,rj | x k )
k =1 j =2

r1

• The posterior probability distribution:

∑ P( X
h =1
10/18/2007

r1

1

= x1,h | x 2,r 2 , x3,r 3 ) = 1
F. Jensen, Bayesian Networks and Decision Graphs, 2001 12

BN Training
• Set of training data, T, to learn BN, B=(S,Θ) • Use structural learning algorithm to find Soptimal
– Search method needed to decrease search space – Scoring Metric

• Maximum Likelihood Estimation (MLE) algorithm to learn CPTs, P(Θoptimal |Soptimal, T )

10/18/2007

13

Joint Probability Scoring Metric
• Relative measure of the level of compatibility of Sh given the training data T, P(Sh |T) • P(Sh |T) α P(Sh,T):
P ( S h , Τ) = P( S h | Τ) P (Τ) ⇒ P ( S ih , Τ) P ( S ih , Τ) P( S ih | Τ) P (Τ) = = h h P( S j | Τ) P ( S j , Τ) P ( S h , Τ) j P (Τ) ∴ P ( S ih | Τ) < P ( S h | Τ) ⇔ P ( S ih , Τ) < P ( S h , Τ) j j
10/18/2007 14

Scoring Metric Assumptions
1. 2. 3. 4. 5. All variables are discrete All structures are equally likely All variables are known with no missing variables All cases in T occur independently given a BN model No prior knowledge of the numerical properties to assign to Bh with structure Sh before observing T
B h = (S h , Θ h ) ∈ Β P( S h , T ) = ∫ f (T | S h , Θ h ) f (Θ h | S h ) P( S h )dΘ h ⇒
Θ ri (ri − 1)! P( S h , T ) = P ( S h ) ⋅ ∏∏ ∏ N ijk ! i =1 j =1 ( N ij + ri − 1)! k =1 n qi

10/18/2007

G.F. Cooper et al. , Machine Learning, 1992.

15

Variable Definition for Scoring Metric
ri (ri − 1)! P( S h , T ) = P( S h ) ⋅ ∏∏ ∏ N ijk ! i =1 j =1 ( N ij + ri − 1)! k =1 n qi

n: Number of model variables qi: Number of unique instantiations of πi, where πi=pa(Xi) ri: Number of possible states for Xi Nijk: Number of cases in T that Xi=xi,k and πi is instantiated as wij, where k=(1,…,ri) r Nij: N ij = ∑ N ijk
i

k =1

10/18/2007

G.F. Cooper et al. , Machine Learning, 1992.

16

K2 Algorithm
• Maximizes scoring metric • Node Ordering: X1 X2 X3 • Limit on number of parents
P ( S optimal , T ) = max[ P( S h , T )] ⇒ h
S qi ri n ⎧ ⎫ K2 (ri − 1)! ⎪ ⎪ h max ⎨ P( S ) ⋅ ∏∏ N ijk !⎬ ⎯⎯→ ∏ ⎪ Bh ⎪ ( N ij + ri − 1)! k =1 i =1 j =1 ⎩ ⎭ qi ri ⎧ ⎫ (ri − 1)! ⎪ ⎪ max ⎨ g = ∏ N ijk !⎬ ∏ ⎪ Bh ⎪ ( N ij + ri − 1)! k =1 j =1 ⎩ ⎭
10/18/2007 G.F. Cooper et al. , Machine Learning, 1992. 17

K2’ Algorithm
P(X4) P(X1)| X4) X1 X2

X4

P(X2|X1, X4) )

X3

P(X3 | X1)

10/18/2007

18

X4 X2

X1 X3

K2’ Algorithm

• Inhibit nodal connections between Input nodes – d-seperation: X3 ⊥ X2, X4 iff X1 is known – X2 and X3 are not affected by X4 X1 relationship if parents are known • Everything else same as K2

10/18/2007 19

The Learned Structure
• Initial Structure is an empty graph • Final Structure is learned from T O1 O2 … Om I1 I2 … Ik
: Only for K2
10/18/2007 20

Parameter Learning
• Maximum Likelihood Estimator (MLE) for parameter learning due to no missing values (EM algorithm otherwise) • MLE determines the parameters that maximize the probability (likelihood) of T

f (T | Θ h , S h ) = ∏ f (C i | Θ h , S h ) = L(Θ h | T , S h )
i =1

t

Λ = ln L = ∑ ln[L(Θ h | T , S h )] ⇒
i =1

t

∂Λ =0 h ∂Θ
10/18/2007 21

Modeling Example
• Train a Network Model for a simple problem with two inputs and two outputs:
– Inputs, CSA
(1) Place of aggression characteristics (2) Amount of disorder provoked from fight/struggle (1) Gender of the offender (2) Presence of sexual relations between victim and offender

– Outputs, criminal profile:

• •

Train model with simulated cases using Matlab Use evidence of the inputs to infer outputs in new cases
22

10/18/2007

Example: Variable Definition
Inputs from Crime Scene Analysis (CSA) Checklist (evidence): • Characteristics about Place of Aggression, node PA
(1) (2) (3) (4) Not crowded external place (remote) Semi-crowded external place( semi-remote) Crowded external place Inner place (room, building, office) In room/area where corpse is found On all the area/room/study/office/store In the vicinities of area/room/study/office/store No disorder provoked

DiSorder Provoked by fight/struggle, node DS
(1) (2) (3) (4)

Outputs from Network Model, criminal profile: • Gender of offender, node G
(1) (2) Male Female

Presence of Sexual Relations between victim and offender, node SR
(1) (2) Yes No
23

10/18/2007

Network Model
1 Solved Case: PA = 2 (Semi-remote) DS = 4 (No disorder) G = 1 (Male) SR = 2 (No)

Network variables: PA (Place of Aggression), DS (DiSorder provoked by fight), G (Gender of offender), SR (Sexual Relations between victim and offender).
10/18/2007 24

Results: Percent Binary Error for Validation Set
• Error metric: if x = x* , then error = 0, if x ≠ x*, then error = 1
Percentage of Each Output Node Inferred Incorrectly vs. Number of Training Cases

10/18/2007

25

Full CP Model
• 247 cases:
– 200 training cases (T) – 47 validation cases (V)

• 57 total variables
– 36 CS variables (inputs) – 21 CP variables (outputs)

10/18/2007

26

Internal Stability
• Internal stability refers to the consistency of predictions

10/18/2007

27

Overall Performance Summary
• Overall Predictive Accuracy (OPA):

OPA(%) =
Kt : Total predictions

K C ,CL ( ≥50%) Kt

× 100

KC,CL(≥50%): Total correct predictions with CL ≥50% Algorithm Accuracy (%) Correct Predictions (number of nodes)
10/18/2007

K2 64.1% 633

K2’ 79% 780
28

Confidence Level of Predictions
• Confidence Level Accuracy (CLA):

CLA(%) =
Algorithm: CL: KCL

K C ,CL K CL

× 100

K2 || K2’ KC,CL CLA(%) Δ (KC,CL) 22 52 73
29

0.5 ≤CL< 0.7 225 || 262 140 || 162 62.2 || 61.8 0.7≤CL< 0.9 0.9 ≤ CL
10/18/2007

405 || 470 334 || 386 82.5 || 82.1 168 || 255 159 || 232 94.6 || 91

Zero Marginal Probability Variables
Algorithm CL KCL K2 || K2’ ≥ 50% 798 || 987

ZMP Variables

K t − ( K w + K ZMP ) × 100 PA(%) = Kt
1. Add more training cases 2. Declare variable independencies (K2’) 3. Decrease number of system variables
10/18/2007 30

High Frequency Variables
3. Decrease number of system variables:
– – High Frequency variables are present in more than 50% of cases High Frequency Model (HFM): CP model with HF Variables removed

CS Behavior Face not hidden Victim found at scene where killed Victim found face up Multiple wounds to the body
10/18/2007

Frequency 88.4% 78.9% 61.1% 52.2%
31

HFM Overall Predictive Accuracy
• Negligible accuracy increase for K2’ • Decrease in the number of ZMP variables for K2
Algorithm: OPA (%) KC,≥50% KZMP
10/18/2007

K2HFM K2’HFM 66% 652 168 79.6% 786 0

K2 64.1% 633 189

K2’ 79% 780 0
32

Frequency of Occurrence
• Frequency of occurrence is the number of times the variable was present in the dataset • Frequency method (F) predicts the states of V by the more apparent state in T • The CP by F is the same over all V

10/18/2007

33

OPA for K2’ and F
• Overall Predictive Accuracy: Algorithm Accuracy (%) K2’ 79% F 79.3% 784

Correct Predictions 780 (number of nodes)

10/18/2007

34

Confidence Level of Predictions
• Confidence Level Accuracy (CLA):

Algorithm: CL: 0.5 ≤ CL< 0.7 0.7 ≤ CL< 0.9 0.9 ≤ CL< 0.95 0.95 ≤ CL KCL

K2’ || F KC,CL CLA(%) 65.7 || 61.8 82.1 || 84.3 90 || 89.3 92.2 || 98 Δ (KC,CL) -54 -10 -1 61

262 || 329 162 || 216 470 || 470 386 || 396 139 || 141 125 || 126 116 || 47 107 || 46

10/18/2007

35

Information Entropy
• Information Entropy (H) quantifies the certainty/uncertainty of a model • Amount of information is related to the confidence of the prediction: less entropy means more confidence in the long run
H = −∑ pi log( pi )
i =1 ri

10/18/2007

36

H for K2’ and F
• K2’ model H is an infeasible calculation • An average of each variable’s H is a suitable measure
H ( X 1,..., X k ) = ∑ H ( X i | X i −1 ,..., X 1 ) ⇒
i =1 k k

H ( X 1,..., X k ) ≤ ∑ H ( X i )
i =1

• H(K2’)=0.43 vs. H(F)=0.48
10/18/2007 37

CL Ranges for Predictions

10/18/2007

38

Crime Scene Variables
• Input variables from Crime Scene (evidence)
Input Variable I1, pen I2, hid I3, blnd I4, blnt I5, suff
10/18/2007

Definition Foreign Object Penetration Face Hidden Blindfolded Wounds caused by blunt Instrument Suffocation (other than strangulation)
39

Criminal Profile Variables
• Output variables comprising the criminal profile
Output Variable O1, yoff O2, thft O3, frd O4, brg O5, rlt O6, unem O7, male O8, famr
10/18/2007

Definition Young offender (17-21 years old) Criminal record of theft Criminal record of fraud Criminal record of burglary Relationship with victim Unemployed at time of offense Male Familiar with area of offense occurrence
40

Predicted Case vs. Actual Case
• K2’ Profile • Frequency Profile -A- Young: A (0.813) Young: A (0.805) -A- Theft: A (0.75) Theft: A (0.54) -P- Fraud: A (0.76) Fraud: A (0.67) -A- Burglary: A (1) Burglary: A (0.67) Relationship: A (0.64) -A- Relationship: A (1) Unemployed: P (0.52) -P- Unemployed: P (0.79) -A- Male: A (1) Male: P (0.9) Familiar w/ area: P (0.86) Familiar w/ area: P (0.91) -P10/18/2007 41

Predicted Case vs. Actual Case
• K2’ Profile • Frequency Profile -A- Young: A (0.83) Young: A (0.81) -P- Theft: P (0.52) Theft: A (0.54) -A- Fraud: A (0.67) Fraud: A (0.67) -P- Burglary: A (0.54) Burglary: A (0.67) Relationship: A (0.64) -A- Relationship: A (0.57) Unemployed: P (0.52) -P- Unemployed: P (0.53) -P- Male: P (0.82) Male: P (0.9) Familiar w/ area: P (0.86) Familiar w/ area: P (0.86) -P10/18/2007 42

Predicted Case vs. Actual Case
• K2’ Profile • Frequency Profile -A- Young: A (0.83) Young: A (0.81) -A- Theft: P (0.97) Theft: A (0.54) -P- Fraud: A (0.67) Fraud: A (0.67) -A- Burglary: P (0.60) Burglary: A (0.67) Relationship: A (0.64) -P- Relationship: A (0.59) Unemployed: P (0.52) -A- Unemployed: P (0.61) -A- Male: P (1) Male: P (0.9) Familiar w/ area: P (0.86) Familiar w/ area: P (0.87) -P10/18/2007 43

Predicted Case vs. Actual Case
• K2’ Profile • Frequency Profile -A- Young: A (0.93) Young: A (0.81) -A- Theft: A (0.75) Theft: A (0.54) -P- Fraud: P (0.73) Fraud: A (0.67) -A- Burglary: A (0.82) Burglary: A (0.67) Relationship: A (0.64) -P- Relationship: P (0.56) Unemployed: P (0.52) -A- Unemployed: A (0.87) -P- Male: P (0.89) Male: P (0.9) Familiar w/ area: P (0.86) Familiar w/ area: P (0.78) -P10/18/2007 44

Evidence: pen: penetration blnd: blindfolded hid: face hidden blnt: blunt instrument suff: suffocation CP: yoff: young offender frd: fraud rlt: relationship w/ victim thft: theft brg: burglary famr: familiar w/ area unmp: unemployed
10/18/2007

Slice of K2’ DAG

yoff

thft

pen

frd

brg

unmp suff

rlt

male

famr

blnd

hid

blnt
45

Conclusions
• Due to the absence of ZMP variables, the K2’ structural learning algorithm requires fewer cases compared to K2 • A benefit of a BN model over the naïve frequency approach to acquire a CP is the range of confidence levels of the BN model due to the evidence • Because all of the variables are binary, the frequency approach is more susceptible to better performance than if the variables had many states
10/18/2007 46

Further Research
• Develop a search algorithm that increases performance for BN • Incorporate Salfati’s Expressive/ Instrumental dichotomy to supervise training of a BN model • Apply method to other fields • Combine NN and BN methods to improve model performance
10/18/2007 47

Neural Network Research
• Neural networks implemented similar to BN Algorithm: Nodes Predicted Correctly Overall PA (%) K2’ 780 79% NN 739 75%

10/18/2007

48

Acknowledgments
• Special thanks to
– Dr. Silvia Ferrari, advisor – Dr. Gabrielle Salfati, John Jay College of Criminal Justice – Dr. Marco Strano, President of the International Crime Analysis Association (ICAA) – My Masters Committee
10/18/2007 49

April Fools' Day Origin
• April 1 was originally New Years Day in France • 1582 Pope Gregory decreed the Gregorian calendar was to replace the Julian calendar • January 1 is New Years day according to new calendar • Those who refused to accept the new calendar were April fools
10/18/2007 50

Questions?

10/18/2007

51

Laboratory for Intelligent Systems & Control Mechanical Engineering Duke University

10/18/2007

52