You are on page 1of 41

University of California

Los Angeles

Multivariate Methods Analysis


of Crime Data in Los Angeles Communities

A thesis submitted in partial satisfaction


of the requirements for the degree
Master of Science in Statistics

by

Yan Fang

2011

c Copyright by
Yan Fang
2011
The thesis of Yan Fang is approved.

Vivian Lew

Nicolas Christou

Robert L. Gould

Jan de Leeuw, Committee Chair

University of California, Los Angeles


2011

ii
To my beloved parents
who
always offer me unconditonal love and support

iii
Table of Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Data Collection and Exploration . . . . . . . . . . . . . . . . . . . 2

2.1 Description of data . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Normality of the data . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Principal Component Analysis on variables correlation . . . . . 8

3.1 Principal Component Analysis theory . . . . . . . . . . . . . . . . 8

3.2 PCA on communities profile variables . . . . . . . . . . . . . . . . 10

4 Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1 Discriminant Analysis theory[7][8] . . . . . . . . . . . . . . . . . . 15

4.2 DA application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 DA on original variables . . . . . . . . . . . . . . . . . . . 18

4.2.2 DA on principal components . . . . . . . . . . . . . . . . . 19

4.2.3 Classify the communities with missing crime rate . . . . . 23

5 Conclusion and improvements for future study . . . . . . . . . . 26

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Improvements for future study . . . . . . . . . . . . . . . . . . . . 27

iv
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

A Useful Tables not Listed in Chapters . . . . . . . . . . . . . . . . 28

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

v
List of Figures

2.1 Normality check samples . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Scree plot of the standard deviations of the PCs . . . . . . . . . . 11

3.2 Plot of variance proportion and cumulative variance proportion


covered by each PC . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 First five eigenvectors for the PCA of the data . . . . . . . . . . . 13

4.1 Histogram and density plots for observations on the first linear
discriminant dimension when using five PCs . . . . . . . . . . . . 21

4.2 Histogram and density plots for observations on the first linear
discriminant dimension when using original variables . . . . . . . 21

4.3 Results of quadratic discriminant classification two variables at a


time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

vi
List of Tables

2.1 Sample of the dataset . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Variables explanation . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 New variables which are after transformation . . . . . . . . . . . . 6

3.1 Variance coverage information of selected five PCs . . . . . . . . . 12

3.2 Table of loading matrix . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Results of applying LDA on the original variables . . . . . . . . . 19

4.2 Results of applying QDA on the original variables . . . . . . . . . 19

4.3 Results of applying LDA on the five PCs . . . . . . . . . . . . . . 20

4.4 Results of applying QDA on the five PCs . . . . . . . . . . . . . . 22

4.5 Summarization of performance of all four conditions . . . . . . . . 22

4.6 Classification results for the missing data(partial) . . . . . . . . . 23

4.7 Coefficients of linear discriminator . . . . . . . . . . . . . . . . . 25

A.1 Variance coverage information of all 15 PCs . . . . . . . . . . . . 28

A.2 Classification results for the missing data . . . . . . . . . . . . . . 30

vii
Acknowledgments

It is a pleasure to thank the many people who made this thesis possible.

Foremost, I would like to express my sincere gratitude to my advisors Pro-


fessor Jan De Leeuw, Professor Robert L. Gould, Professor Vivian Lew for the
continuous support of my Master study and research, for their patience, motiva-
tion, enthusiasm, and immense knowledge. Their guidance helped me in all the
time of research and writing of this thesis.

Besides my advisor, I would like to thank the Professor Nicolas Christou to


be my committee member, also for his encouragement, insightful comments.

I thank my fellow classmates in UCLA for the sleepless nights we were study-
ing and working together, and for all the fun we have had in the last two years.
Thanks for all those who always support and encourage me during the completion
of the thesis.

Last but not the least, I would like to thank my family: my parents Laian
Fang and Xuejuan Shen, for giving birth to me at the first place and supporting
me spiritually throughout my life.

viii
Abstract of the Thesis

Multivariate Methods Analysis


of Crime Data in Los Angeles Communities
by

Yan Fang
Master of Science in Statistics
University of California, Los Angeles, 2011
Professor Jan de Leeuw, Chair

The scope of crime prevention has grown considerably in the last few years. What
was previously the sole concern of the police and the private security industry has
spread into areas from real estate developer, car manufacturers, residents’ groups,
building public facilities like society offices and shopping centers. All these call
for continuously using improved new ways to prevent crime. Therefore, how to
discover the variables which have salience in affecting crime rate becomes crucial.
In this thesis, multivariate statistical analysis is utilized on the crime data of L.A.
city to explore the above topic. Specifically, we use Principal Component Analysis
to discover the distinct influential variables in the identification of a community
with high or low crime rate and try to construct a baseline to classify a community
as safe or unsafe via Discriminant Analysis. The above analytical techniques not
only can make police departments have a sense of dangerous communities to pay
more enforcement on them, it assist governments in finding out what variables
they need to change to improve the city to become a better place to live as well.

ix
CHAPTER 1

Introduction

Crime rate is often considered to be an extremely important index to judge the


welfare and the quality of living within a particular area. People value the safety
issue when they move, decide to purchase a real estate and maybe need to be
relocated for better career opportunities.

However, what factors may be related to safety? Are there some variables
you can observe to predict the safety and security of communities even without
using the actual crime rate statistics? We want to have a baseline to classify a
particular community because there are 47 out of total 272 communities in L.A.
missing data of crime commitment number according to the records from Los
Angeles Times[1].

We try to solve the above questions with the help of multivariate statistical
techniques based on the variables used to profile L.A. communities. We will
use Principal Component Analysis to do pattern recognition, determining which
variables greatly affect communities’ safety. Plus, Discriminant Analysis will be
applied to create a rule to classify a community into a safe one or an unsafe one.
The potential results are valuable for police department to increase enforcement
and step up patrols in unsafe communities; for those living in communities with
high crime rates forever be on the lookout for criminal offenders; for city councils
to decrease the crime rate by increasing or decreasing certain variables.

1
CHAPTER 2

Data Collection and Exploration

As is known to known, the causes and origins of crime are various and complicated
and they even diverse a lot from area to area. When talking about factors which
may have impact on the crime rate, everyone can think of several variables.
Fortunately, variables affecting crime has been the investigation subjects of many
disciplines historically. Therefore, we can summarize those known factors to
impacting crime volume and crime type occurring from place to place as follows[2]:

1. Population density and level of urbanization

2. Composition of the population, particulaly percentage of youth

3. Modes of transportation system

4. Economic condition

5. Cultural and Educational factors

6. Climate factors

7. Effective strength of law enforcement agencies

8. Citizens’ attitudes toward crime

2.1 Description of data

We ultimately analyze data based on 15 variables that fall under several general
characteristics categories such as population characteristics, economic character-

2
istics, social characteristics, housing characteristics and climate characteristics.
It’s worth mentioning that we do not include all the common variables listed
above because factors like culture and transportation do same affect the crime of
a particular area, which in our case is the city of L.A. In addition, compared to
the country and the state, some statistics of a community in a city are not mea-
surable and applicable. We collect data from Los Angeles Time Local Search[1],
Yahoo Real Estate[3] and Wikipedia, all from the year 2008 because this is the
only recent year for which we could find complete data. The dataset totally has
132 entries, each entry represents the information of a particular community in
L.A. city. Table 2.1 gives a glance of the first five entries of the dataset, we notice
that some values in TEMP1 are unusual and the reader can check Yahoo Real
Estate for inference. Table 2.2 are the list of all the variables and what they
stand for.
Community CRIME POP DENS ETHN AVRAGE EDU AHI AHS SP
1 Acton 46.0 6522 166 81.7 37 16.6 83983 3.0 144.0
2 Agoura Hills 50.2 20324 2495 82.8 37 48.4 117608 3.0 551.0
3 Alhambra 85961 11275 13.8 35 27.5 53224 2.8 2646.0
4 Altadena 71.5 42680 4900 39.6 37 39.2 82676 2.8 1347.0
5 Arcadia 52951 4749 40.3 40 44.4 75808 2.7 1392.0
6 Arlington Heights 105.0 23330 21423 4.7 31 13.9 31421 3.0 1165.0
... ... ... ... ... ... ... ... ... ... ...

Community VETE FB TEMP1 TEMP7 SD UNEMP CLI


1 Acton 15.6 7.1 32 98 285 7.1 138
2 Agoura Hills 9.9 13.5 30 98 286 3.1 187
3 Alhambra 5.0 50.8 42 89 281 6.1 125
4 Altadena 9.8 18.7 43 89 287 7.1 130
5 Arcadia 8.0 43.6 42 89 287 4.0 162
6 Arlington Heights 5.3 49.8 11 83 185 5.3 118
... ... ... ... ... ... ... ... ...

Table 2.1: Sample of the dataset

3
Variable Name
POP population
DENS number of people per square mile
ETHN percentage of white people
AVRAGE median age of the residents
EDU percentage of residents having bachelor or higher degree
AHI median household income
AHS average household size
SP number of family headed by single parent
VETE percentage of veterans
FB percentage of foreign-born residents
TEMP1 January average temperature
TEMP7 July average temperature
SD annual sunny days
UNEMP unemployment rate
CLI cost of living index

Table 2.2: Variables explanation

2.2 Normality of the data

Often, before doing any statistical modeling, it is crucial to verify whether the
data at hand satisfy the underlying distribution assumptions. Multivariate nor-
mal distribution is one of the most frequently made distributional assumptions
when using multivariate statistical techniques, e.g. Principal Component Analy-
sis and Discriminant Analysis. Also, from an important property of multivariate
normal distribution, we know that if X=(X1 , X2 , ..., Xp ) follow the multivari-

4
ate normal distribution, then its individual components X1 , X2 , ..., Xp are all
normally distributed. Therefore, we need to examine normality of each Xi to
guarantee that X=(X1 , X2 , ..., Xp ) is multivariate normal distributed.

Here, we use quantile-quantile plot (QQ plot) to assess normality of data


though there are more formal mathematical assessment methods. The reason
is that with a large dataset, formal test can detect even mild deviations from
normality which actually we may accept due to the large sample size. However, a
graphical method is easier to interpret and also have the benefit to easily identify
the outliers. In QQ plot, we compare the real standardized values of the variables
against the standard normal distribution. The correlation between the sample
data and normal quantiles will measure how well the data is modeled by a normal
distribution. For normal data, the points plotted should fall approximately on a
straight line in the QQ plot. If not, data transformation like logarithm, square
root and power transformation can be applied to make the data appear to more
closely normally distributed.

Drawing QQ plot of each variable reveals that 6 out of the 15 variables approxi-
mately follow normal distribution, they are: ETHN, AVRAGE, AHS, VETE, FB,
TEMP7. We try different forms of transformation on the remain 9 variables to
obtain the substitute variables which perform better on normality. Table 2.3 list
the specific forms used to get the new variables. Figure 2.1 is the QQ plot of
POP and AHI in terms of both before-transformation and after-transformation.
We can see the effectiveness of data transformation method because data tends
to be normally distribution after transformation.

5
New Variable
sqrt(POP)
sqrt(DENS)
ETHN
AVRAGE
sqrt(EDU)
log(AHI)
AHS
sqrt(SP)
VETE
FB
squared(squared(TEMP1))
TEMP7
sqrt(log(SD))
log(log(UNEMP))
log(CLI)

Table 2.3: New variables which are after transformation

6
Normal Q-Q Plot Normal Q-Q Plot
4e+05

600
sqrt(POP)

400
POP

2e+05

200
0e+00

0
-2 -1 0 1 2 -2 -1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Normal Q-Q Plot Normal Q-Q Plot


12.0
150000

log(AHI)

11.0
AHI

50000

10.0

-2 -1 0 1 2 -2 -1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Figure 2.1: Normality check samples

7
CHAPTER 3

Principal Component Analysis on variables


correlation

In general, having a large number of variables in a study makes it difficult to do


analysis because variables sometimes capture the same information, i.e, there is
multicolinearity between variables. The redundancy in those variables makes the
pattern of association indistinct. Therefore, we introduce Principal Component
Analysis(PCA) in this chapter, which is a method to rotate the variable matrix
X = (X1 , X2 , ..., Xp ) to achieve orthogonality to decipher patterns more easily
and at the same time reduce the dimension of data for simpler data process.
With the assistance of PCA, we hope to get a clear pattern of our communities
profile data, to see how to use less but without much loss variables to catch up
the whole data information.

3.1 Principal Component Analysis theory

PCA is a mathematical procedure that uses an orthogonal transformation to


convert a set of observations of possibly correlated variables into a set of values
of uncorrelated variables called principal components(PCs). PCA is appropriate
and very useful when you have measured data on a number of variables(possibly
a large number of variables) and believe that some of the variables are correlated

8
with one another, possibly because they are measuring a similar information
structure. Because of this redundancy, you believe that it should be possibly to
reduce the observed variables into a smaller number of PCs which will account
for most of the variance in the observed variables.

Simply stated, the main objective of PCA is to find a rotation p×p matrix U =
(U1 , U2 , ..., Up ) such that the linear combinations of all of the original variables
are orthogonal. These combinations are so-called PCs and the PCs’ order is
sorted by the amount of variance they account for. The first PC has as high
variance as possible(that is, accounts for as much of the variability in the data as
possible), and each succeeding component in turn has the highest variance under
the constraint that it be orthogonal to the preceding components. Specifically,
if Z = XU , we want Z T Z = diag (λ1 , λ2 , ..., λp ), where λ1 ≥ λ2 ≥ ... ≥ λp ≥ 0.
Here, we denote Σ = X T X as the covariance matrix of X. Since

Z T Z = U T X T XU = U T ΣU (3.1)

we can easily verify that λ1 , λ2 , ..., λp are the eigenvalues of Σ and λi is the variance
of Zi . The eigenvectors of X T X are the columns of U [4][5]. The percentage of
variance that a particular PC accounts for is defined as:
variance of a particular PC λi
= Pp (3.2)
sum of all the variances i=1 λi

All above have indicated that PCA allows us gain information about the
entire dataset from the fewer PCs than the original variables because of certain
correlation between these two sets, which can be reflected in the loading matrix
U . However, there still some points we should pay attention to:

(1) Meaning of the loading matrix U . The loading matrix contains information
about the weights of original variables when calculating each PC. It shows which
PC each original variable is most highly associated with. In addition, the highly

9
correlated original variables with a particular PC will serve to determine the label
of that PC.

(2) Number of PCs extracted. PCA is used to reduce a set of observed vari-
ables to a new set of variables with lower dimensionality. Choice of this dimension-
ality, that is the number k of PCs, is crucial for the interpretation of results and
subsequent analysis because it could lead to a loss information(underestimation)
or the introduction of random noise(overestimation). The standard criterion
is to choose the number beyond which all the eigenvalues are relatively small.
A good way is to make a scree plot which graphs an index plot of eigenval-
ues. Secondly, to adequately cover the information of the dataset, we hope
the cumulative variance that k PCs account for can achieve a certain level, i.e.
Pk
λi
Pi=1
p ≥ threshold (eg. 0.8 or 0.9)
i=1 λi

Next, we’ll use PCA on the previous data in order to discover the internal
structure of those L.A. communities profile variables and further to get the labeled
PCs suitable for subsequence classification analysis.

3.2 PCA on communities profile variables

We now run PCA on all 15 profile variables. It should be noted that PCA is
sensitive to the scaling of the variables. In our case, variables are measured on
different scales, so we need to use the correlation matrix instead of covariance
matrix to have the unit variance of data.

The first and very important step of PCA is to determine the number of PCs.
Here, we use the criterion described in section 3.1 to do the selection. As shown in
Figure 3.1, the “elbow” appears at the 4th PC, therefore it seems like we should
keep the first three PCs. However, we still need to check Figure 3.2 in which the

10
black line represents the proportion of variance covered by each variable and the
red line is the cumulative variance proportion when choosing different number of
PCs. We find that three PCs account for nearly 70% of the total variance which
we think is an insufficient information coverage. Instead, we prefer to use five
PCs which have almost 85% cumulative variance. The standard deviation of the
5th PC is not that small compared to the remain PCs, which suggests five PCs
is a valuable reduction in dimensionality.
2.5
2.0
Standard deviation of PC

1.5
1.0
0.5

2 4 6 8 10 12 14

PC number

Figure 3.1: Scree plot of the standard deviations of the PCs

11
1.0
0.8
0.6
0.4
0.2
0.0

2 4 6 8 10 12 14

PC number

Figure 3.2: Plot of variance proportion and cumulative variance proportion cov-
ered by each PC

Table 3.1 gives standard deviation, proportion of variance and cumulative


percentage of the total variance for the first five PCs. It is a fact that we have
accounted for 84.8% of total variance of the original variables and simultaneously
reduce the data dimension from 15 to 5. Values mentioned above for all the PCs
can be found in Appendix A.1.

SD % of Variance Cumulative % of Variance


PC1 2.601 0.451 0.451
PC2 1.452 0.140 0.591
PC3 1.263 0.106 0.698
PC4 1.185 0.094 0.791
PC5 0.923 0.057 0.848

Table 3.1: Variance coverage information of selected five PCs

12
In Figure 3.3, we visualize the coefficients of the variables in each PC and
combined with the exact values of these coefficients in Table 3.2(also known as
loading matrix of PCA), we can clearly see which variables are dominant in
each PC. For example in PC2, squared(squared(TEMP1)) and TEMP7 are have
relative large absolute value, hence we denote PC2 as climate factor. Similarly,
view PC3 as population factor, PC4 as factor related to annual sunny days,
PC5 as employment condition factor. However, it is kind of difficult to define
PC1 in which ETHN, AVRAGE, sqrt(EDU) and log(AHI) all have relative large
absolute value of 0.34 level, which indicates had interpretation of PCs is one
major disadvantage of PCA.

PC1
PC2
PC3
0.4

PC4
PC5
0.2
Coefficient

0.0
-0.2
-0.4

2 4 6 8 10 12 14

Number of variable

Figure 3.3: First five eigenvectors for the PCA of the data

13
Index PC1 PC2 PC3 PC4 PC5
1 sqrt(POP) 0.175 0.356 -0.520 0.194 -0.039
2 sqrt(DENS) 0.269 0.266 0.196 0.265 0.014
3 ETHN -0.349 0.022 -0.054 0.142 -0.095
4 AVRAGE -0.339 0.043 -0.064 0.098 0.117
5 sqrt(EDU) -0.336 0.115 0.028 0.227 0.068
6 log(AHI) -0.339 -0.028 -0.064 -0.156 0.214
7 AHS 0.286 -0.169 -0.000 -0.344 0.291
8 SP 0.214 0.319 -0.494 0.120 -0.182
9 VETE -0.293 -0.172 -0.294 -0.106 -0.253
10 FB 0.300 0.145 0.164 0.064 0.405
11 squared(squared(TEMP1)) -0.061 0.541 0.197 -0.346 0.030
12 TEMP7 0.125 -0.414 -0.367 -0.030 0.388
13 sqrt(log(SD)) -0.110 0.239 -0.337 -0.545 0.231
14 log(log(UNEMP)) 0.148 -0.039 0.117 -0.467 -0.548
15 log(CLI) -0.277 0.288 0.145 -0.053 0.271

Table 3.2: Table of loading matrix

14
CHAPTER 4

Discriminant Analysis

As seen in the previous chapter, PCA extracts features, i.e. PCs that well de-
scribe the pattern of the communities’ profile variables. However, are they good
enough for distinguishing between classes and finding separating patterns. We
need this because our next consideration is to seek the discriminative features
which can be viewed as baseline so as to classify a set of observations into pre-
defined classes. Specific to our case is to classify a particular community as safe
or unsafe. What’s obvious is that the features extracted by PCA are actually
“global” features for all, thus these features are not necessarily representative
for discrimination one class from another. Therefore, in this chapter, we will
discuss Discriminant Analysis(DA), a suitable method for discrimination differ-
ent pattern classes, which aims at finding a transformation of the variables via
maximizing the between-class variance and minimizing the within-class variance.

4.1 Discriminant Analysis theory[7][8]

Discriminant Analysis is a method used in statistics and pattern recognition to


find a combination of features which characterize or separate two or more classes
of objects or events. The resulting combination may be used as a classifier to
assign objects to previously defined classes. For this case, we only consider the
situation of two classes.

15
The classification process is based on the values of profile variables X =
(X1 , X2 , ..., Xp ). We extract a majority of the original data as the training sample
which has been split into two parts with known and correct class. The remaining
N − size of training sample is called the testing sample which will be used to
validate the classification rule demonstrated from the training sample.

Measurements of all objects of one class k are characterized by a probability


density function fk (X) which are seldom known. And there might be some
prior knowledge about the probability of observing a member of class k, the
prior probability πk with π1 + π2 + ... + πk = 1. To estimate fk (X) and πk , a
training sample is used. Most often applied classification rules are based on the
multivariate normal distribution

1 1 T
Σ−1
fk (X) = f (X|k) = p/2
e− 2 (X−µk ) k (X−µk ) (4.1)
(2π) |Σk |1/2
where µk and Σk are the class k population mean vector and covariance matrix.
Under such assumption, the probability that one object with given vector X =
(X1 , X2 , ..., Xp ) to belong to class k can be calculated by the formula below

f (X|k) π(k)
p (k|X) = PK ∝ f (X|k) π(k) (4.2)
k=1 π(k)f (X|k)

Taking the logarithm of Equation 4.2 will lead to the discriminant function

dk (X) = (X − µk )T Σ−1
k (X − µk ) + log|Σk | − 2logπk (4.3)

and the classification rule

dk̂ (X) = min dk (X) ⇐⇒ max p (k|X) (4.4)

The rule described in Equation 4.3 and Equation 4.4 is called Quadratic Discrim-
inant Analysis(QDA). When a special case that all k class covariance matrices

16
are identical Σk = Σ, the discriminant function can be simplified to Equation 4.5
which is called the Linear Discriminant Analysis(LDA).

dk (X) = 2µTk Σ−1 X − µTk Σ−1 µk − 2logπ(k) (4.5)

Therefore, the first step of performing DA is to check whether the covariance


matrix of the two classes’ in training sample are equal so that appropriate analysis
method, QDA or LDA, will be applied. Chi-square test can be used for checking
the equality of the two covariance matrices which we do not elaborate in details.

4.2 DA application

The original data, it is a data set has 15 variables for 132 communities of L.A. city
(the number is less than the actual number because only these 132 community
have all variables’ value we need). Here, we will introduce one more response
variable named CRIME which is a record of number of crimes per 10000 people
for the recent six months(from Oct. 18, 2010 to April 17, 2011). We can see
Table 2.1 for reference. There are only 85 out of 132 communities having records
on the crime rate, others lack records due to late updates or some other reasons.
So our objective is to name the remain 47 communities are safe or not by using
DA on the known information. One thing may be noticed is that the period is
not consistent with other 15 variables’ which are from 2008, but we just assume
it is acceptable because our focus is on the DA method and also 2010-2011 crime
rate is the only resource we can find for L.A. city communities.

We then classify each community in our data as high or low in crime rate
according to the city average, which is 107.6 crimes per 10000 people. Assign
“D” to communities which are above the city average and “S” to those below the
city average. Thus, we have 38 “D” and 47“S” for the 85 communities.

17
Since our valid data only contains 85 entries, which will be too small data size
to get the accurate classification rule if it is split into training sample and testing
sample. Instead, we use the whole 85 by 15 data matrix to form the baseline
of classification and test its effectiveness on these 85 observations, too. Before
performing DA, we first need to check whether or not the covariance matrices ΣD
and ΣS are equal. Result of Chi-square test convince us that they are significantly
unequal, which implies that QDA should be used on our data.

4.2.1 DA on original variables

We still use the normalized original variables here, which satisfy the underlying
assumption of multivariate normal distribution of DA method.

Although the covariance matrices from two classes are not equal, we decide to
run LDA anyhow in order to have an idea of its performance when compared to
QDA. Using the classification rule generated from 85 observations to re-classify
them gives us the results below. Table 4.1 and Table 4.2 respectively summarize
the number of missclassification and the apparent error rate of missclassifica-
tion(which is in the bracket). For instance, LDA method correctly classify 28
out of 38 as unsafe communities, 10 missclassifications yield an apparent error
rate of 26.3% and it missclassifies 8 out of 47 safe communities, causing an error
rate of 17%, hence the percent of correct is 78.8%. However, the results of QDA
method in Table 4.2 are surprising. 10.5% of the unsafe communities and only
6.4% of the safe communities are missclassified. We acquire 91.8% correct rate.
Thus we conclude QDA does preform better than LAD and the quite high correct
rate makes us believe QDA can be a effective procedure for classifying unknown
observations in our case.

18
D S
D 28 (0.737) 10 (0.263)
S 8 (0.170) 39 (0.830)

Table 4.1: Results of applying LDA on the original variables

D S
D 34 (0.895) 4 (0.105)
S 3 (0.064) 44 (0.936)

Table 4.2: Results of applying QDA on the original variables

4.2.2 DA on principal components

Since PCs capture the majority information of the original data and at the same
time the number of PCs are much less than that of original variables, it is easy to
think of using the factors computed from PCA as the input futures for DA algo-
rithm. The new features which are a linear combination of the original variables,
have several advantageous properties: (1) The interpretation of them allows to
detect patterns in the initial data space. (2) It is a very large reduced number
of factors. Moreover, we can remove noise from the dataset by using only the
most relevant factors. (3) Algorithms such as LDA could have a better behavior
because PCs come from an orthogonal basis.

DA applied on PCs is similar with the above description, difference is that


the data matrix is 85 by 5 which shows the correlation between each observation
and each particular PC of our selected five.

Let’s have a look at results from LDA. Table 4.3 says 9 out of 38 unsafe
communities are missclassified, its apparent error rate is 23.7% and 17% error

19
rate that missclassify 8 out of 48 safe communities into the opposite class. The
rate of correct classification is consequently 80%, a little larger than the value
78.8% of using LAD on the original variables. Figure 4.1 is the histogram and
density plots for observations on the first linear discriminant dimension(LD1)
when using the five PCs. The horizontal axis depicts scores the observations get
on LD1, thus the more apparently the two histograms separate from each other,
the more between-class difference an LD1 can tell. Figure 4.2 is the similar
plot in original variables using case. No matter whether we compare Table 4.1
and Table 4.3 or Figure 4.1 and Figure 4.1, we find that LDA performs better
classification on PCs than on the original variables. What about QDA method?

D S
D 29 (0.763) 9 (0.237)
S 8 (0.170) 39 (0.830)

Table 4.3: Results of applying LDA on the five PCs

20
0.8
0.4
0.0
-2 -1 0 1 2 3

0.8
0.4
0.0 group D

-2 -1 0 1 2 3

group S

Figure 4.1: Histogram and density plots for observations on the first linear dis-
criminant dimension when using five PCs
0.6
0.3
0.0

-3 -2 -1 0 1 2 3

group D
0.6
0.3
0.0

-3 -2 -1 0 1 2 3

group S

Figure 4.2: Histogram and density plots for observations on the first linear dis-
criminant dimension when using original variables

21
Again, we review the number and proportion of missclassification table for
QDA on PCs case which is showed in Table 4.4. We can easily get the rate of
correct of classification is 76.5%.

D S
D 27 (0.711) 11 (0.284)
S 9 (0.191) 38 (0.809)

Table 4.4: Results of applying QDA on the five PCs

To have a better sense of all four circumstances we conduct above, we sum-


marize the rate of type I error(missclassify “D” to “S”), rate of type II er-
ror(missclassify “S” to “D”) and rate of correct classification in one Table 4.5.

Type I error Type II error Correct classification


LDA on original variables 0.263 0.17 0.788
QDA on original variables 0.105 0.064 0.918
LDA on PCs 0.237 0.17 0.8
QDA on PCs 0.284 0.191 0.765

Table 4.5: Summarization of performance of all four conditions

According to the last column of Table 4.5, we evaluate the performance of


these four classification models as follow:

QDA on original Vs ≥ LDA on PCs ≥ LDA on original Vs ≥ QDA on PCs(4.6)

We can see QDA models applied on the original variables has an overwhelming
advantage in classification accuracy, therefore it will be used to classify those
communities without crime attribute for later use in our case.

22
4.2.3 Classify the communities with missing crime rate

Now, we try to classify those 47 communities that do not have commitment num-
ber into two classes: safe or unsafe. Running quadratic discriminant classification
on the after-transformation original variables and then calculating the posterior
of each community being classified into a particular class gives us the results like
Table 4.6. We just show 5 out of 47 here, refer to Table A.2 for the whole clas-
sification results. This classification method assigns 18 of the 47 as the relative
dangerous communities and the remain 29 are relative safe communities which we
consider has a high correct classification rate according to our previous analysis.

Index Community Class Posterior.D Posterior.S


3 Alhambra S 0.000 1.000
5 Arcadia S 0.000 1.000
10 Azusa D 0.641 0.359
11 Baldwin Park D 0.811 0.189
12 Bell D 0.539 0.461
... ... ... ... ...

Table 4.6: Classification results for the missing data(partial)

Figure 4.3 is a visual display of the quadratic discriminant classification re-


sults in terms of two variables at one time. The black text identifiers are correctly
classified ones and the red identifiers represent those which are classified by mis-
take. Since unlike the LDA, QDA dose not have a specific form of discriminator,
we can only have a basic sense of what is the relation between each variable
and the discriminator from a partition plot like Figure 4.3 whose discrimination
boundaries are hard to describe. Therefore, though QDA has better classifica-
tion accuracy, we still need to resort to LDA to discover the contribution of each

23
variable to the discriminator. We will exhibit an example below to illustrate it.

Partition Plot
app. error rate: 0.298 app. error rate: 0.234 app. error rate: 0.255
D D S
SSD
500

500
S SSDDSSD DD

DENS
S S S SSS
SD
POP

POP
D S DSD D
SD D
SDD DD
SS SD
DS

20 80
SDSD
SD D DS
SD
DS DD D
DSSD DS
S DSDSDS SD S
S
D D
D
S S
SS
S SSSD
SSSD SSS
SD S S SD SS S SS
D
SSSS
S
SDS
100

100
SSD
SS SS SS

20 60 100 140 0 40 80 120 0 40 80 120

DENS SP SP

app. error rate: 0.255 app. error rate: 0.298 app. error rate: 0.277
D SS D

140
DS
500

DS S DD
S
DENS
S D S SS D
S SS
POP

S SD S D
SDS D D

SP
S S D
S D DD

20 80
SDD D S DS DS D
DD
DS DS DD
DD DS

60
S DD SS S S SDD D S DS
S S S D DSSD
S S S
S S
D SSS S SSS
S S
DSD
S S S
SDSD
D DD
SS
100

S
D S S
S
S
S S
S S

0
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8

UNEMP UNEMP UNEMP

app. error rate: 0.319 app. error rate: 0.298 app. error rate: 0.255
D S D

140
D
S
S
500

D DS
SD
D SSS SS DS S
DENS

S S S
POP

DDDS DDS
D DSD

SP
DDS
D
20 80

DDS
DS
D SD SS D D S DS S S D
D
S D S

60
D
D D
S
SD
S
DS
SSS S S S D
SDS S SSS
D D DDD
D
S
S
DS
SS
D SSSD
SSSD
100

S S S S SS S D
SS SD SS SS S S S S

0
4.4 4.8 5.2 5.6 4.4 4.8 5.2 5.6 4.4 4.8 5.2 5.6

CLI CLI CLI

app. error rate: 0.404


D
SS
DS
DDD
SD
SS
S SSSDS DS S S
UNEMP

S
0.6

D D
DD
S
SD
SS
DS S D
S S
SS
S
0.0

S
4.4 4.8 5.2 5.6

CLI

Figure 4.3: Results of quadratic discriminant classification two variables at a time

LDA on PCs is chosen to conduct the linear classification because this model
performs well second only to QDA on original variables. The model assigns 28
unsafe communities out of 47 total and 19 safe communities, which does differ a lot
from the 18 “D” and 29 “S” classification before. But we want to say the linear

24
method has better interpretation of input features effect on the discriminator.
Table 4.7 shows the coefficients of PCs in the linear combination of them forming
the discriminator. We see PC2, the behalf of climate factor, has the highest
contribution to linear discrimination function, then is PC5 which represents the
employment condition.

LD1

PC1 -0.21
PC2 -0.68
PC3 0.39
PC4 -0.38
PC5 0.50

Table 4.7: Coefficients of linear discriminator

25
CHAPTER 5

Conclusion and improvements for future study

This thesis mainly studies the roles of the Principal Component Analysis and
Discriminant Analysis in the pattern recognition and classification of the L.A.
communities profile data which is for crime rate study. We see effectiveness of
these two primary technique in excavating internal structure of data with large
dimensionality, but meanwhile problems also exist and more improvements should
be made to enlarge the significance and accuracy of such pattern recognition
process.

5.1 Conclusion

(1) Principal Component Analysis is successfully applied into our data by extract-
ing five PCs out of the 15 original variables, which implies a great dimensionality
reduction. In addition, these PCs account for almost 85% variance of the original
dataset, thus we do not loss much information.

(2) A highly correlated original variables with a particular PC can serve to


determine the label of that PC. Ethnicity, median age of the residents in one
community, percentage of residents having bachelor degree and annual household
income collaboratively make contribution to PC1. PC2, PC3, PC4 and PC5
represent climate factor, population factor, a factor related to annual sunny days
and employment condition factor respectively. That is to say, the above eight

26
variables can considerably help when classifying a community as safe or unsafe.

(3) To make up the fact that the common features PCA extract is the the
“global” features which are not necessarily much representative for discrimi-
nant on class from another. A more suitable method for discrimination differ-
ent pattern classes, Discriminant Analysis is applied to classify the communities
which have no data records of crime commitment number. Apply Liner DA and
Quadratic DA on original variables and PCs’ space respectively. For our case,
model of quadratic discriminant on the original variables performs best.

(4) According to the formed discriminant rule, we classify the 47 unrecored


observations. 18 unsafe communities and 29 safe communities.

5.2 Improvements for future study

(1) The variables we choose is based on the historically study of possible fac-
tors causing crime, and also are up to the accessibility of the variables. If data
resources is sufficient, we can analyze additional variables especially category
variables such as major traffic mode, alcohol assumption amount because dis-
criminant analysis has superiority when dataset involves category variables which
is not verified in our research.

(2) Discriminant analysis can only tell the likely class of a certain observation
but do not know certain properties of an observation in a very specific degree such
as within a confidence interval. Hence, we may further use regression method to
find the significant variables for further prediction work.

27
APPENDIX A

Useful Tables not Listed in Chapters

SD % of Variance Cumulative % of Variance


PC1 2.601 0.451 0.451
PC2 1.452 0.140 0.591
PC3 1.263 0.106 0.698
PC4 1.185 0.094 0.791
PC5 0.923 0.057 0.848
PC6 0.781 0.041 0.889
PC7 0.666 0.030 0.918
PC8 0.602 0.024 0.943
PC9 0.514 0.018 0.960
PC10 0.478 0.015 0.975
PC11 0.445 0.013 0.989
PC12 0.291 0.006 0.994
PC13 0.208 0.003 0.997
PC14 0.148 0.001 0.999
PC15 0.141 0.001 1.000

Table A.1: Variance coverage information of all 15 PCs

28
Index Community Class Posterior.D Posterior.S
3 Alhambra S 0.000 1.000
5 Arcadia S 0.000 1.000
10 Azusa D 0.641 0.359
11 Baldwin Park D 0.811 0.189
12 Bell D 0.539 0.461
14 Bell Gardens S 0.486 0.514
15 Beverly Hills D 1.000 0.000
18 Burbank D 0.851 0.149
25 Claremont S 0.001 0.999
27 Covina S 0.001 0.999
28 Cudahy S 0.065 0.935
29 Culver City S 0.001 0.999
32 Downey D 1.000 0.000
34 El Monte D 0.944 0.056
35 El Segundo S 0.014 0.986
38 Gardena S 0.364 0.636
39 Glendale S 0.001 0.999
40 Glendora S 0.000 1.000
45 Hawthorne D 1.000 0.000
46 Hermosa Beach S 0.000 1.000
48 Hungtington Park D 0.686 0.314
49 Inglewood D 1.000 0.000
50 Irwindale S 0.000 1.000
58 La Verne S 0.000 1.000
63 Long Beach D 1.000 0.000
66 Manhattan Beach S 0.000 1.000

29
Index Community Class Posterior.D Posterior.S
68 Maywood S 0.013 0.987
70 Monrovia D 0.736 0.264
71 Montebello D 0.653 0.347
72 Monterey Park S 0.000 1.000
80 Palos Verdes Estates S 0.000 1.000
83 Pasadena S 0.020 0.980
86 Pomona D 0.722 0.278
90 Redondo Beach S 0.001 0.999
97 San Fernando S 0.002 0.998
98 San Gabriel S 0.000 1.000
99 San Marino S 0.000 1.000
102 San Fe Spring S 0.001 0.999
103 Santa Monica D 0.911 0.089
105 Sierra Madre S 0.107 0.893
106 Signal Hill D 0.990 0.010
107 South Gate S 0.049 0.951
112 Sun Valley S 0.080 0.920
117 Torrance S 0.002 0.998
123 Vernon S 0.015 0.985
124 West Covina D 0.981 0.019
129 Whittier D 0.905 0.095

Table A.2: Classification results for the missing data

30
References

[1] “Neighborhoods”, Los Angeles Times,


available at http://projects.latimes.com/mapping-la/neighborhoods/
\\neighborhood/list/page/1/

[2] U.S Department of Justice, “Variables Affecting Crime”, Crime in the United
States, 2009, 2010.

[3] Yahoo! Real Estate, Yahoo! Inc.,


avaliable at http://realestate.yahoo.com/neighborhoods

[4] Julian J. Faraway, Linear Models with R, Chapman Hall/CRC, Boca Raton,
Florida, 2005, pp. 133–140.

[5] “Principal Component Analysis”, Wikipedia Encyclopedia, 2011,


available at http://en.wikipedia.org/wiki/Principal\_component\
_analysis

[6] Wu, Ying, Principal Component Analysis and Linear Discriminant Analysis,
available at http://users.eecs.northwestern.edu/~yingwu/teaching/
EECS510/\\Notes/PCA\_LDA\_handout.pdf

[7] Birkel, Daniela, Regularized Discriminant Analysis,


available at http://www.uni-leipzig.de/~strimmer/lab/courses/ss06/
seminar/\\slides/daniela-2x4.pdf

[8] Khattree, Ravindra and Dayanand N. Naik, Applied Multivariate Statistics


With SAS Software, Second Edition, Cary, NC, SAS Institute Inc., 1999, pp.
1–17

31

You might also like