Professional Documents
Culture Documents
net/publication/318077508
CITATIONS READS
19 6,528
6 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Habibullah Magsi on 27 July 2017.
vsutahar@yahoo.co.uk
2
Department of Agricultural Economics, Sindh Agriculture University, Tandojam, Pakistan; hmagsi@sau.edu.pk
3
Department of Rural Sociology, Sindh Agriculture University, Tandojam, Pakistan; ruralsociologyst@gmail.com
4
Information Technology Centre, Sindh Agriculture University, Tandojam, Pakistan; mubeena2009@gmail.com,
qureshi5939@gmail.com
Abstract
Objectives: To apply Principal Component Analysis to medical data to explore the factors thought to be very important
in increasing the risk of Ischemic Heart Diseases. Methods/Statistical Analysis: PCA was performed in R-mode using
correlation and covariance for medical data. Variables pertaining to chemical tests of blood namely cholesterol, high density
lipoprotein, triglyceride, Apo protein A-1, Apo protein B, low density lipoprotein, phospholipids, total lipid, glucose and
uric acid, are undertaken to know the relationship between them and membership of group variable. Findings: The results
indicated that among these factors cholesterol, triglyceride, Apo protein B, low density lipoprotein, phospholipids, total
lipid, and uric acid were recorded to be higher in IHD group compared to those of control group. High density lipoprotein
and Apo protein A-1 were recorded to be lower in IHD group, whereas found higher in control group. Cholesterol was
highly correlated with low density lipoprotein and moderately correlated with total lipid. Cholesterol, Apo protein B and
low density lipoprotein belonged to component 1, Apo protein A-1, phospholipids and uric acid belonged to component 2,
triglycerides and total lipid belonged to component 3 and high density lipoprotein and glucose belonged to component 4.
The first four components have explained 60.67 percent component variability. Improvements/Applications: The end
results showed that the average cholesterol level, which is considered as the main risk factor of Ischemic Heart Disease,
was found higher even in control group.
of observed variables, so there is a considerable parsi- biology, pharmacy, finance, agriculture, ecology, health
mony in using the factor analysis. Furthermore, when the and architecture etc. Similarly, PCA was applied to mul-
scores of factors are estimated for each subject, they are tispectral-multimodal optical image analysis for malaria
often more reliable than scores on individual observed diagnostics9 whereas Nandi et al applied PCA in medical
variables. Anderson1, developed the asymptotic theory image processing10.
for principal component analysis and proved many theo-
rems about the uses of principal component analysis.
Steps in principal component analysis include select-
2. Methodology
ing and measuring a set of variables, preparing the The data used for this study were originally collected
correlation matrix (to perform principal component by Dr. Zaffar Ali Pirzada, Professor, Chandka Medical
analysis), extracting a set of factors from the correlation College Larkana, Pakistan, for his doctoral study program
matrix, determining the number of factors, (probably) at University of Sindh Jamshoro, Pakistan. After success-
rotating the factors to increase interpretability, and finally, ful completion of his PhD, Dr. Pirzada provided the data
interpreting the results. Although, there are relevant sta- set to the Department of Statistics, Sindh Agriculture
tistical considerations to most of these steps, an important University Tandojam, Pakistan, and allowed to apply
test of the analysis is hidden in its interpretability. sophisticated test to explore the factors increasing the
With the passage of time numerous variants of PCA risk of Ischemic Heart Diseases (IHD). In the present
have been proposed and applied in the literature2,3. study, variables pertaining to chemical tests of blood are
In4 applied nonlinear principal component analysis undertaken to know the relationship between them and
(NLPCA) as a generalization of traditional principal membership of group variable.
component analysis (PCA) that allows for the detection
and characterization of low-dimensional nonlinear struc- 2.1 Principal Component Analysis (PCA)
ture in multivariate datasets. Likewise, nearly one and a
Principal component analysis technique was performed
half decade ago, Kim and Park documented that kernel
in R-mode using correlation and covariance for medical
principal component analysis (PCA) had recently been
data11. Principal component analysis is based on the fol-
proposed as a nonlinear extension of PCA5,6. The basic
lowing linear model:
idea was to first map the input space into a feature space
via a nonlinear map and then compute the principal com- Yij = βi1 X 1 j + βi 2 X 2 j + ........... + βip X pj
ponents in that feature space. (1)
Pierpaolo and Paolo reported that Principal Where i, j = 1,2,....., p
Component Analysis (PCA) was a famous procedure
In correlation matrix based on complete data, the
with the objective to produce massive amounts of quan-
sizes of some correlation place constraints on the sizes of
titative information with the help of a small quantity of
others. In particular,
latent variables, which are known as factors7. In their
research, they proposed the extended version of the PCA r13 r23 = (1 − r132 )(1 − r232 ) ≤ r12 ≤ r13 r23 + (1 − r )(1 − r )
2
13
2
23
(2)
which is quite feasible in terms of its applicability while
High bi-variate correlation matrix contains factors.
dealing with the interval data set. The proposed method
Due to the reason that correlation exists between two
was named as Midpoint Radius Principal Component
variables, it is useful to look at frameworks of individual
Analysis (MR-PCA). Taking the interval valued data set,
correlations where match shrewd relationships are bal-
the method works by using both the midpoints (which
anced for impact of every single other variable.
are also known as centers of the data) and the measure of
interval width (which is also known as radii, the singular
radius) information. In order to analyzed how MR-PCA 2.2 Bartlett’s Test
works, the results of a simulation study and two applica- In12 defined Bartlett’s test of sphericity as a notoriously
tions on chemical data were proposed. sensitive test of the hypothesis that the correlations in a
The book, “Principal Component Analysis: correlation matrix are zero. But because of its sensitivity
Multidisciplinary Applications” by8 presented the appli- and dependence on N, the test is likely to be significant
cations of PCA in various fields such as taxonomy, with samples of substantial sizes even if correlations are
2 Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology
Naeem Ahmed Qureshi, Velo Suthar, Habibullah Magsi, Muhammad Javed Sheikh, Mubeena Pathan and Barkatullah Qureshi
very low. Therefore, use of the test is recommended only tors and eigenvalues in factor analysis. In the end, when
if there are less than five cases per variable. these correlation matrices are diagonalized, the informa-
tion that these matrices contain is repackaged13.
2.3 Anti-Image Correlation The way of calculating the eigenvalues and eigen-
vectors are not easily but in fact very laborious since it
Hostile to Image relationship framework contains the
involves solving equal number of equations say p with
negatives of individual relationship between’s sets of fac-
equal number of unknowns say p after placing some
tors with impacts of different factors evacuated. On the
constraints that could not be easy to perform manually.
off chance that R is factorable, there are for the most part
Upon the completion of the calculation of eigen values
little values among the off-diagonal members to diagonal
and eigenvectors the only remaining job that principal
components of the anti-image matrix.
component analysis has to do is simply ‘drop out’. The fol-
lowing equations (from 5 to 8) are better explaining the
2.4 Extraction Methods
procedure.
The diagonalization of the matrices can be easily traced The previously used equation can be rearranged as
back in different texts related to linear algebra while hold- follows:
ing down certain conditions. An important theorem
R = VLV (5)
from matrix algebra indicates that, under certain condi-
tions, matrices can be diagonalized. The diagonalization The product of three matrices jointly forms the cor-
of a matrix leaves us with the matrix which has positive relation matrix i.e., the product of matrices containing the
numbers on main diagonal and all non-diagonal elements eigenvalues and the eigenvectors and its transpose. Their
are zeros. The elements on main diagonal represents the reorganization ended up by taking the square root of the
variances from the correlation matrix can be written as eigenvalue matrix.
follows:
L = V ′RV (3)
R = V L LV ′ (6)
Or
The matrix V contains the columns which are known
as eigenvectors, while the elements on the main diagonal
(
R= V L )( )
LV ′
of L matrix are called eigenvalues. There exists one-to-
one correspondence between the eigenvectors and the
If V L is called A and LV ′ is Aꞌ, then
eigenvalue i.e., each eigenvector corresponds to each
R = AA′ (7)
eigenvalue. However, the objective of factors analysis is
to precise how variables are correlated with each other by Put it in simple, the resulting correlation matrix is
keeping only those factors in the final solution which have simply formed buy the product of the matrices of eigen-
large corresponding eigenvalues (usually factors with vectors and the square root of the eigenvalues as shown
eigenvalue greater than 1 are retained and the remaining in the above equation (7). This equation is treated as the
are discarded) as each eigenvalue corresponds to different basic equation in performing factor analysis which simply
factor. consists of the product of factor loading matrix, A, and its
When the transpose of the matrix of eigenvectors post- transpose.
multiplied by the identity matrix (a matrix whose main It can be easily seen from the above two equations (6)
diagonal elements are 1 and all off-diagonal elements are and (7) that the major task in doing Factor Analysis (FA)
0). Hence this operation of pre and post multiplication or Principal Component Analysis (PCA) is finding the
will not affect the correlation matrix up to much extent eigenvalues and the eigenvectors. After their successful
and can be rewritten it as given below. computation, there exists a straightforward way to form
the (unrotated) factor loading matrix by using the follow-
V ′V = I (4)
ing multiplication rule:
Due to the fact that correlation matrices possess the
property of diagonalization after following a very easy A = V L (8)
way, this property makes it feasible and possible as well to The correlation between factors and variables is
use them efficiently using the linear algebra of eigenvec- expressed in terms of correlation matrix. Every column
Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology 3
Application of Principal Component Analysis (PCA) to Medical Data
represents the correlation between the correspond- were suffering from ischemic heart disease) and control
ing variable i.e., the first column shows the correlation group. The following table shows the descriptive statistics
between the first variable, the second column shows the of ten chemical tests of blood causing ischemic heart dis-
correlation between the second variable and so on. ease patients and control group.
4 Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology
Naeem Ahmed Qureshi, Velo Suthar, Habibullah Magsi, Muhammad Javed Sheikh, Mubeena Pathan and Barkatullah Qureshi
Table 2. Correlation matrix of various chemical tests increasing the risk of IHD
Cholest- High Trigly- Apo Apo Low density Phospho- Total Glucose Uric
erol Density cerides Protein Protein Lipoprotein lipids lipid Acid
lipopro- A-1 B
tein
Cholesterol 1.000 -.196 .230 -.246 .275 .606 .199 .421 -.086 .057
High Density -.196 1.000 -.170 .186 -.167 -.241 -.015 -.190 .150 -.029
lipoprotein
Triglycerides .230 -.170 1.000 -.236 .154 .172 -.023 .271 -.036 -.119
Apo Protein -.246 .186 -.236 1.000 -.218 -.235 -.001 -.086 .004 -.050
A-1
Apo Protein .275 -.167 .154 -.218 1.000 .377 -.028 .212 -.103 .108
B
Low density .606 -.241 .172 -.235 .377 1.000 .307 .413 -.096 .119
lipoprotein
Phospholipids .199 -.015 -.023 -.001 -.028 .307 1.000 .118 -.023 -.014
Total lipid .421 -.190 .271 -.086 .212 .413 .118 1.000 -.055 -.087
Glucose -.086 .150 -.036 .004 -.103 -.096 -.023 -.055 1.000 -.049
Uric Acid 0.57 -.029 -.119 -.050 .108 .119 -.014 -.087 -.049 1.000
Table 3. Anti-image correlation matrix of various chemical tests increasing the risk of IHD
Cholest- High Trigly- Apo Apo Low Phosp- Total Glucose Uric
erol Density cerides Protein Protein density holipids lipid Acid
lipopro- A-1 B Lipo-
tein protein
Cholesterol .7595 .0111 -.0888 .1152 -.0259 -.4360 -.0425 -.2123 .0298 -.0238
High Density .0111 .8157 .0817 -.1107 .0375 .1067 -.0405 .0763 -.1284 .0130
Lipoprotein
Triglycerides -.088 .0817 .7346 .1792 -.0493 .0137 .0644 -.1747 .0081 .1312
Apo Protein .1152 -.1117 .1792 .7454 .1118 .0793 -.0395 -.0922 .0506 .0296
A-1
Apo Protein -0.259 .0375 -.0493 .1118 .7692 -.2550 .1456 -.0531 .0645 -.0732
B
Low density -.4360 .1067 .0137 .0793 -.2550 .7002 -.2786 -.1978 .0136 -.1145
Lip protein
Phospholipids -.0425 -.0405 .0644 -.0395 .1456 -.2786 .5752 -.0009 .0106 .0505
Total lipid -.2123 .0763 -.1747 -.0922 -.0531 -.1979 -.0009 .7828 -.0046 .1279
Glucose .0298 -.1284 .0081 .0506 .0645 .0136 .0106 -.0046 .6881 .0356
Uric Acid -.0238 .0130 .1312 .0296 -.0732 -.1145 .0505 .1279 .0356 .4974
Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology 5
Application of Principal Component Analysis (PCA) to Medical Data
mg/dl while those who belonged to control group their 3.4 Initial and Extraction Communalities
average value was recorded to be 181.80mg/dl. Total lipid Analysis
found in ischemic heart disease patients was found to be
922.67 mg/dl whereas it was recorded in control group as In principal component analysis initial and extraction
771.75 mg/dl. Average value of glucose found in ischemic communalities for variables are the variances accounted
heart disease patients was noted as 88.66 mg/dl while for the factors. For variance and co-variance analysis ini-
in control group this was found to be 85.80 mg/dl. The tial communalities remain always equal to 1.0. For the
average value of uric acid found in ischemic heart dis- extraction communalities, these values are the proportion
ease patients was found to be 6.39 mg/dl while in control to variance of each variable by the rest of the variables.
group it was recorded to be 5.17 mg/dl. The extraction communalities are the sum of squared
When Table 1 studied, it was found that the average loadings for a variable across factors, in Table 4, the
values of cholesterol, triglyceride, Apo protein B, low den- extraction communalities for cholesterol are (0.770)2
sity lipoprotein , phospholipids, total lipid, glucose and + (0.180)2 + (0.0823)2 + (0.0877)2 = 0.639. That was
uric acid were higher in ischemic heart disease patients Factor1+Factor2+Factor3+Factor4 accounted for 63.9%
as compared to those of control group while high density of the variance calculated for cholesterol. Variances cal-
lipoprotein and Apo protein A-1 were found to be higher culated for high density lipoprotein, triglyceride, Apo
in control group. protein A-1, Apo protein B, low density lipoprotein, phos-
Though the control group has the normal values but pholipids, total lipid, glucose and uric acid were 42.1%,
due to increase in average value of cholesterol the people are 57.6%, 435.4%, 74.7%, 45.5%, 64.1%, 53.6%, 81.6% and
under heavy risk of increasing the ischemic heart disease. 72.3% respectively.
As already stated in the last section that those factors
which have small variances do not fit in the factor solution
3.2 Correlation Analysis
and should be discarded from the analysis, our empirical
Correlation coefficients between ten chemical tests for results presented in Table 4 shows that high density lipo-
Ischemic Heart Disease are shown in Table 2. It could be protein and Apo protein B can be easily discarded from
observed from this table that cholesterol has positively the analysis due to explaining a very small proportion of
high correlation with low density lipoprotein (0.606) and the variance. Since this proportion of variance explained
it is moderately correlated with total lipid (0.421). There by a factor is calculated by dividing the sum of square
is negative correlation between high density lipoprotein loadings with the number of variables used, hence for
and low density lipoprotein. The observed correlation these two factors, it was calculated as 99.1% whereas the
of triglyceride with total lipid (0.271) is positively weak. covariance for these factors was calculated as 100%.
Apo protein A-1 has negatively weak whereas Apo pro-
tein B has positively weak correlation with low density 3.5 Component Analysis
lipoprotein. There is highly positive correlation observed
between low density lipoprotein and total lipid. Table 5 shows the eigenvalues and total variance explained
for our factor solution. There are ten variables used in the
sample, because the aim of factor analysis is to precise
3.3 Anti Image Correlation Analysis
the pattern of correlation with a factor as possible. Since
Discussing Table 3, it becomes clear that almost all the every eigenvalue corresponds to a different factor (usually
values at the lower or upper half of the main diagonal are only factor with large eigenvalues are retained). In a good
smaller, i.e. close to zero, which indicates that variables factor analysis, these few factors defined the whole corre-
are relatively free from unexplained correlations. In the lation matrix. When no limit is placed on the number of
main diagonal of anti-image correlation matrix, each factors eigenvalues are computed for each of ten possible
value describes the measure of sampling adequacy for components as shown in Table 5.
the respective item. Values less than 0.5 may indicate that It is quite clear from the Table 6 that the first four com-
a variable does not seem to fit with the structure of the ponents has their eigenvalues over 1 and are large enough
variables. Table 3 all the variables in main diagonal have to be retained, their variances are 26.65%, 12.02%, 11.79%
values more than 0.5, which reveals that, the variables and 10.19% respectively. These four components describe
have adequate samples. the 60.67% of the total variance.
6 Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology
Naeem Ahmed Qureshi, Velo Suthar, Habibullah Magsi, Muhammad Javed Sheikh, Mubeena Pathan and Barkatullah Qureshi
Table 4. Relationship among loadings, communalities, variance, co-variance of orthogonally rotated factors
Variables Initial Factor 1 Factor 2 Factor 3 Factor 4 Extraction
Cholesterol 1.000 .770 .180 0.0823 0.0877 0.639765
Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology 7
Application of Principal Component Analysis (PCA) to Medical Data
3.6 Component Score Coefficient Analysis in the component 3 were triglyceride and total lipid. The
variables had high weight and belonged to component 4
The component loading matrix is a matrix of correlation
were high density lipoprotein and glucose. The important
between components and variables. The first column is
finding of this study is that the average cholesterol level
the correlation between the first component and each
which is considered to be main factor increasing the risk
variables, the second column is the correlation between
of Ischemic Heart Disease is found to be higher even in
the second component and each variable and so on. Table
control group than normal values.
6 reveals that there were few strong correlation observed
between components and their respective variables. A
component is interpreted from variables that are highly 5. References
correlated with it-that had loadings on it.
1. Anderson TW. Asymptotic theory for principal component
In the evident from Table 6 that the variables that
analysis. Annals of Mathematical Statistics. 1963; 34:122-
had high loading are assumed to belong with the com-
48. Crossref.
ponent 1, i.e., cholesterol (0.289), Apo protein B (0.205) 2. Sharma UM. Hybrid Feature Based Face Verification and
and low density lipoprotein (0.302). Likewise in compo- Recognition System Using Principal Component Analysis
nent 2 the variables had the high weights were found to and Artificial Neural Network. Indian Journal of Science
be Apo protein A-1 (0.302), phospholipids (0.566) and and Technology. 2015 Jan; 8(S1):1-6. Crossref.
uric acid (0.259). Similarly, triglyceride (0.239) and total 3. Sindhubargavi R, Sindhu MY, Saravanan R. Spectrum
lipid (0.300) belonged to component 3. In the same way Sensing Using Energy Detection Technique for Cognitive
in component 4 the variables with high scores were high Radio Networks Using PCA Technique. Indian Journal of
density lipoprotein (0.330) and glucose (0.803). Science and Technology. 2014 Apr; 7(S4):40-5.
4. Linting M, Meulman JJ, Groenen PJ, van der Koojj AJ. Non-
linear principal components analysis: introduction and
4. Conclusion application. Psychological Methods. 2007; 12(3):336-58.
Crossref. PMid:17784798.
The step wise principal Component Analysis was used to 5. Kim KI, Park SH. Kernel principal component analysis for
explore the effects of these chemical tests which are sig- texture classification. IEEE Signal Processing Letters. 2001;
nificantly affecting the patients namely cholesterol, high 8(2):39-41. Crossref.
density lipoprotein, triglyceride, Apo protein A-1, Apo 6. Jayabalan L, Ganapathi P. Anomaly based Malicious Traffic
protein B, low density lipoprotein, phospholipids, total Identification using Kernel Extreme Machine Learning
lipid, glucose and uric acid of patients who were suffer- (KELM) Classifier and Kernel Principal Component
ing from Ischemic Heart Disease. In the case of finding Analysis (KPCA). Indian Journal of Science and Technology.
correlations between variables, it was observed that 2016 Apr; 9(13):1-11. Crossref.
Cholesterol was highly correlated with low density lipo- 7. D’Urso P, Paolo G. A least squares approach to Principal
Component Analysis for interval valued data. Economics
protein and moderately correlated with total lipid. Anti
and Statistics Discussion Paper. 2002; p. 1-29.
image correlations interpreted that the variables were rel- 8. Sanguansat P. Croatia: InTech Publishers: Principal
atively free from unexplained correlation. It was observed Component Analysis: Multidisciplinary Applications.
from communalities extractions that high density lipo- 2012. Crossref.
protein and Apo protein B has small variances and do not 9. Omucheni DL, Kaduki KA, Bulimo WD, Angeyo HK.
fit in the factor solution. To know the variability between Application of principal component analysis to multi-
the components, it was found that first four components spectral-multimodal optical image analysis for malaria
account for exactly 60.67% of the total variability. These diagnostics. Malaria Journal. 2014; 13:1-11. Crossref
four components adequately describe the whole correla- .PMid:25495235 PMCid:PMC4364660.
10. Nandi D, Ashour AS, Samanta S, Chakraborty S,
tion matrix. Finally it was obvious to see that cholesterol,
Mohammad AMS, Dey N. Principal component analysis in
Apo protein B and low density lipoprotein had high value
medical image processing: a study. International Journal of
and belongs to the components 1. In component 2, it was Image Mining. 2015; 1(1):65-86.
found that Apo protein A-1, phospholipids and uric acid 11. Murthy HSN, Meenakshi M. Efficient Algorithm for
had the high scores. The variables that had high loadings Early Detection of Myocardial Ischemia using PCA based
8 Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology
Naeem Ahmed Qureshi, Velo Suthar, Habibullah Magsi, Muhammad Javed Sheikh, Mubeena Pathan and Barkatullah Qureshi
Features. Indian Journal of Science and Technology. 2016 13. Murali M. Principal Component Analysis based Feature
Oct; 9(39):1-13. Vector Extraction. Indian Journal of Science and
12. Snedecor GW, Cochran WG. USA: Iowa State University Technology. 2015 Dec; 8(35):1-4. Crossref.
Press: Statistical Methods. 8th ed. 1989.
Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology 9