You are on page 1of 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/318077508

Application of Principal Component Analysis (PCA) to Medical Data

Article  in  Indian Journal of Science and Technology · February 2017


DOI: 10.17485/ijst/2017/v10i20/91294

CITATIONS READS
19 6,528

6 authors, including:

Naeem Ahmed Qureshi Habibullah Magsi


Sindh Agriculture University Sindh Agriculture University
28 PUBLICATIONS   60 CITATIONS    90 PUBLICATIONS   804 CITATIONS   

SEE PROFILE SEE PROFILE

Muhammad Javed Sheikh Mubeena Pathan


Sindh Agriculture University Sindh Agriculture University
21 PUBLICATIONS   148 CITATIONS    13 PUBLICATIONS   57 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Understanding cyber harassment in Pakistani context View project

Agriculture Research Sindh View project

All content following this page was uploaded by Habibullah Magsi on 27 July 2017.

The user has requested enhancement of the downloaded file.


ISSN (Print) : 0974-6846
Indian Journal of Science and Technology, Vol 10(20), DOI: 10.17485/ijst/2017/v10i20/91294, May 2017 ISSN (Online) : 0974-5645

Application of Principal Component Analysis (PCA)


to Medical Data
Naeem Ahmed Qureshi1, Velo Suthar1, Habibullah Magsi2, Muhammad Javed Sheikh3,
Mubeena Pathan4 and Barkatullah Qureshi4
Department of Statistics, Sindh Agriculture University, Tandojam, Pakistan; qureshistat@gmail.com,
1

vsutahar@yahoo.co.uk
2
Department of Agricultural Economics, Sindh Agriculture University, Tandojam, Pakistan; hmagsi@sau.edu.pk
3
Department of Rural Sociology, Sindh Agriculture University, Tandojam, Pakistan; ruralsociologyst@gmail.com
4
Information Technology Centre, Sindh Agriculture University, Tandojam, Pakistan; mubeena2009@gmail.com,
qureshi5939@gmail.com

Abstract
Objectives: To apply Principal Component Analysis to medical data to explore the factors thought to be very important
in increasing the risk of Ischemic Heart Diseases. Methods/Statistical Analysis: PCA was performed in R-mode using
correlation and covariance for medical data. Variables pertaining to chemical tests of blood namely cholesterol, high density
lipoprotein, triglyceride, Apo protein A-1, Apo protein B, low density lipoprotein, phospholipids, total lipid, glucose and
uric acid, are undertaken to know the relationship between them and membership of group variable. Findings: The results
indicated that among these factors cholesterol, triglyceride, Apo protein B, low density lipoprotein, phospholipids, total
lipid, and uric acid were recorded to be higher in IHD group compared to those of control group. High density lipoprotein
and Apo protein A-1 were recorded to be lower in IHD group, whereas found higher in control group. Cholesterol was
highly correlated with low density lipoprotein and moderately correlated with total lipid. Cholesterol, Apo protein B and
low density lipoprotein belonged to component 1, Apo protein A-1, phospholipids and uric acid belonged to component 2,
triglycerides and total lipid belonged to component 3 and high density lipoprotein and glucose belonged to component 4.
The first four components have explained 60.67 percent component variability. Improvements/Applications: The end
results showed that the average cholesterol level, which is considered as the main risk factor of Ischemic Heart Disease,
was found higher even in control group.

Keywords: Application, Ischemic Heart Diseases, PCA

1. Introduction related with other subset of variables which are combined


together to form a factor. The basic idea is that these
Principal component analysis (PCA) is statistical tech- newly formed factors drive the underlying process due to
nique is applicable in the situation when the researcher is which the variables in the data set are supposed to corre-
dealing with single set of variables and wants to discover late with each other. The specific goals of the component
that what are the main variable(s) which are important in analysis are to summarize patterns of correlations among
the formation of a coherent factors in such a way that there observed variables, to reduce a large number of observed
exists no correlation between these newly formed factors. variables to a smaller number of factors, and to provide
Since PCA works in a highly correlated environment so it an operational definition (a regression equation) for an
chooses a set of variables that are highly dependent with underlying process by using observed variables. Since the
each other but, at the same time, they are quite uncor- number of factors are usually far fewer than the number

*Author for correspondence


Application of Principal Component Analysis (PCA) to Medical Data

of observed variables, so there is a considerable parsi- biology, pharmacy, finance, agriculture, ecology, health
mony in using the factor analysis. Furthermore, when the and architecture etc. Similarly, PCA was applied to mul-
scores of factors are estimated for each subject, they are tispectral-multimodal optical image analysis for malaria
often more reliable than scores on individual observed diagnostics9 whereas Nandi et al applied PCA in medical
variables. Anderson1, developed the asymptotic theory image processing10.
for principal component analysis and proved many theo-
rems about the uses of principal component analysis.
Steps in principal component analysis include select-
2. Methodology
ing and measuring a set of variables, preparing the The data used for this study were originally collected
correlation matrix (to perform principal component by Dr. Zaffar Ali Pirzada, Professor, Chandka Medical
analysis), extracting a set of factors from the correlation College Larkana, Pakistan, for his doctoral study program
matrix, determining the number of factors, (probably) at University of Sindh Jamshoro, Pakistan. After success-
rotating the factors to increase interpretability, and finally, ful completion of his PhD, Dr. Pirzada provided the data
interpreting the results. Although, there are relevant sta- set to the Department of Statistics, Sindh Agriculture
tistical considerations to most of these steps, an important University Tandojam, Pakistan, and allowed to apply
test of the analysis is hidden in its interpretability. sophisticated test to explore the factors increasing the
With the passage of time numerous variants of PCA risk of Ischemic Heart Diseases (IHD). In the present
have been proposed and applied in the literature2,3. study, variables pertaining to chemical tests of blood are
In4 applied nonlinear principal component analysis undertaken to know the relationship between them and
(NLPCA) as a generalization of traditional principal membership of group variable.
component analysis (PCA) that allows for the detection
and characterization of low-dimensional nonlinear struc- 2.1 Principal Component Analysis (PCA)
ture in multivariate datasets. Likewise, nearly one and a
Principal component analysis technique was performed
half decade ago, Kim and Park documented that kernel
in R-mode using correlation and covariance for medical
principal component analysis (PCA) had recently been
data11. Principal component analysis is based on the fol-
proposed as a nonlinear extension of PCA5,6. The basic
lowing linear model:
idea was to first map the input space into a feature space
via a nonlinear map and then compute the principal com- Yij = βi1 X 1 j + βi 2 X 2 j + ........... + βip X pj
ponents in that feature space. (1)
Pierpaolo and Paolo reported that Principal Where i, j = 1,2,....., p
Component Analysis (PCA) was a famous procedure
In correlation matrix based on complete data, the
with the objective to produce massive amounts of quan-
sizes of some correlation place constraints on the sizes of
titative information with the help of a small quantity of
others. In particular,
latent variables, which are known as factors7. In their
research, they proposed the extended version of the PCA r13 r23 = (1 − r132 )(1 − r232 ) ≤ r12 ≤ r13 r23 + (1 − r )(1 − r )
2
13
2
23
(2)
which is quite feasible in terms of its applicability while
High bi-variate correlation matrix contains factors.
dealing with the interval data set. The proposed method
Due to the reason that correlation exists between two
was named as Midpoint Radius Principal Component
variables, it is useful to look at frameworks of individual
Analysis (MR-PCA). Taking the interval valued data set,
correlations where match shrewd relationships are bal-
the method works by using both the midpoints (which
anced for impact of every single other variable.
are also known as centers of the data) and the measure of
interval width (which is also known as radii, the singular
radius) information. In order to analyzed how MR-PCA 2.2 Bartlett’s Test
works, the results of a simulation study and two applica- In12 defined Bartlett’s test of sphericity as a notoriously
tions on chemical data were proposed. sensitive test of the hypothesis that the correlations in a
The book, “Principal Component Analysis: correlation matrix are zero. But because of its sensitivity
Multidisciplinary Applications” by8 presented the appli- and dependence on N, the test is likely to be significant
cations of PCA in various fields such as taxonomy, with samples of substantial sizes even if correlations are

2 Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology
Naeem Ahmed Qureshi, Velo Suthar, Habibullah Magsi, Muhammad Javed Sheikh, Mubeena Pathan and Barkatullah Qureshi

very low. Therefore, use of the test is recommended only tors and eigenvalues in factor analysis. In the end, when
if there are less than five cases per variable. these correlation matrices are diagonalized, the informa-
tion that these matrices contain is repackaged13.
2.3 Anti-Image Correlation The way of calculating the eigenvalues and eigen-
vectors are not easily but in fact very laborious since it
Hostile to Image relationship framework contains the
involves solving equal number of equations say p with
negatives of individual relationship between’s sets of fac-
equal number of unknowns say p after placing some
tors with impacts of different factors evacuated. On the
constraints that could not be easy to perform manually.
off chance that R is factorable, there are for the most part
Upon the completion of the calculation of eigen values
little values among the off-diagonal members to diagonal
and eigenvectors the only remaining job that principal
components of the anti-image matrix.
component analysis has to do is simply ‘drop out’. The fol-
lowing equations (from 5 to 8) are better explaining the
2.4 Extraction Methods
procedure.
The diagonalization of the matrices can be easily traced The previously used equation can be rearranged as
back in different texts related to linear algebra while hold- follows:
ing down certain conditions. An important theorem
R = VLV (5)
from matrix algebra indicates that, under certain condi-
tions, matrices can be diagonalized. The diagonalization The product of three matrices jointly forms the cor-
of a matrix leaves us with the matrix which has positive relation matrix i.e., the product of matrices containing the
numbers on main diagonal and all non-diagonal elements eigenvalues and the eigenvectors and its transpose. Their
are zeros. The elements on main diagonal represents the reorganization ended up by taking the square root of the
variances from the correlation matrix can be written as eigenvalue matrix.
follows:
L = V ′RV (3)
R = V L LV ′ (6)
Or
The matrix V contains the columns which are known
as eigenvectors, while the elements on the main diagonal
(
R= V L )( )
LV ′
of L matrix are called eigenvalues. There exists one-to-
one correspondence between the eigenvectors and the
If V L is called A and LV ′ is Aꞌ, then
eigenvalue i.e., each eigenvector corresponds to each
R = AA′ (7)
eigenvalue. However, the objective of factors analysis is
to precise how variables are correlated with each other by Put it in simple, the resulting correlation matrix is
keeping only those factors in the final solution which have simply formed buy the product of the matrices of eigen-
large corresponding eigenvalues (usually factors with vectors and the square root of the eigenvalues as shown
eigenvalue greater than 1 are retained and the remaining in the above equation (7). This equation is treated as the
are discarded) as each eigenvalue corresponds to different basic equation in performing factor analysis which simply
factor. consists of the product of factor loading matrix, A, and its
When the transpose of the matrix of eigenvectors post- transpose.
multiplied by the identity matrix (a matrix whose main It can be easily seen from the above two equations (6)
diagonal elements are 1 and all off-diagonal elements are and (7) that the major task in doing Factor Analysis (FA)
0). Hence this operation of pre and post multiplication or Principal Component Analysis (PCA) is finding the
will not affect the correlation matrix up to much extent eigenvalues and the eigenvectors. After their successful
and can be rewritten it as given below. computation, there exists a straightforward way to form
the (unrotated) factor loading matrix by using the follow-
V ′V = I (4)
ing multiplication rule:
Due to the fact that correlation matrices possess the
property of diagonalization after following a very easy A = V L (8)
way, this property makes it feasible and possible as well to The correlation between factors and variables is
use them efficiently using the linear algebra of eigenvec- expressed in terms of correlation matrix. Every column

Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology 3
Application of Principal Component Analysis (PCA) to Medical Data

represents the correlation between the correspond- were suffering from ischemic heart disease) and control
ing variable i.e., the first column shows the correlation group. The following table shows the descriptive statistics
between the first variable, the second column shows the of ten chemical tests of blood causing ischemic heart dis-
correlation between the second variable and so on. ease patients and control group.

2.5 Communalities, Variance and 3.1 Descriptive Statistics


Covariance It is evident from Table 1 that the majority of total patients
The communality for a variable is the variance accounted for who were suffering from ischemic heart disease had an
by the factors. It is the squared loading multiple correlation average cholesterol value of 236.07 mg/dl where as that of
for the variable as predicted from the factor. Communality control group was found to be 206.69 mg/dl. High density
is the sum of squared loading (SSL) for a variable across lipoprotein of ischemic heart disease patients was found
factors. To find the proportion of variance explained by to be 37.69 mg/dl whereas the control group had its aver-
a factor can be calculated by dividing the sum of square age value of 51.66 mg/dl. On the basis of average value
loadings (SSL) with the number of variables, particularly the patients who had ischemic heart disease their triglyc-
when the rotation of factors id orthogonal. Likewise, the eride level was recorded to be 158.57mg/dl while that
covariance proportion for the particular factor is obtained of the control group was found to be 129.43 mg/dl. Apo
by dividing the sum of square loadings (SSL) for that par- protein A-1 found in ischemic heart disease patients was
ticular factor with the sum of communalities. found to be 99.46 mg/dl whereas in control group this
was recorded as 130.99 mg/dl. The patients possessing
3. Results and Discussion ischemic heart disease had an average value of Apo pro-
tein B 142.42 mg/dl while it was found in control group
In this study we discuss the results of principal compo- as 105.39mg/dl. On average basis phospholipids found in
nent analysis for ten chemical tests of the patients (who patients of ischemic heart disease was noted to be 204.86
Table 1. Descriptive statistics of chemical tests increasing the risk of ischemic heart disease
Variable Normal values Mean Std. Error of Mean
in mg/dl

Ischemic Heart Disease/Control Ischemic Heart Disease/Control


Subject in mg/dl Subject in mg/dl

Control IHD Total IHD Control Total


Cholesterol 100-200 206.6931 236.0723 231.000 2.41680 1.55897 1.43067
High Density M 35-55 51.6634 37.6921 40.1043 .99327 .35413 .40339
Lipoprotein F 45-65
Triglycerides 50-150 129.4257 158.7937 153.723 2.88905 1.51043 1.42084
Apo Protein A-1 120-150 130.9901 99.4587 104.902 1.97712 .92246 .97005

Apo Protein 0-120 105.1386 142.4174 135.981 1.52673 1.04907 1.07791


B
Low density 0-150 124.6733 171.7376 163.612 2.20787 1.46226 1.46594
Lipoprotein
Phospholipids 120-180 181.8020 204.8595 200.878 2.89000 1.85911 1.65589

Total lipid 500-1000 771.7525 922.6736 896.617 7.83452 6.96674 6.37156


Glucose R-120-180 85.8020 88.6591 88.1658 .95658 1.41844 1.18565
Uric Acid M 3.4-7.0 5.1713 6.3913 6.1807 .08796 .06195 .05673
F.M.2.4-5.7

4 Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology
Naeem Ahmed Qureshi, Velo Suthar, Habibullah Magsi, Muhammad Javed Sheikh, Mubeena Pathan and Barkatullah Qureshi

Table 2. Correlation matrix of various chemical tests increasing the risk of IHD
Cholest- High Trigly- Apo Apo Low density Phospho- Total Glucose Uric
erol Density cerides Protein Protein Lipoprotein lipids lipid Acid
lipopro- A-1 B
tein
Cholesterol 1.000 -.196 .230 -.246 .275 .606 .199 .421 -.086 .057

High Density -.196 1.000 -.170 .186 -.167 -.241 -.015 -.190 .150 -.029
lipoprotein
Triglycerides .230 -.170 1.000 -.236 .154 .172 -.023 .271 -.036 -.119

Apo Protein -.246 .186 -.236 1.000 -.218 -.235 -.001 -.086 .004 -.050
A-1
Apo Protein .275 -.167 .154 -.218 1.000 .377 -.028 .212 -.103 .108
B
Low density .606 -.241 .172 -.235 .377 1.000 .307 .413 -.096 .119
lipoprotein
Phospholipids .199 -.015 -.023 -.001 -.028 .307 1.000 .118 -.023 -.014

Total lipid .421 -.190 .271 -.086 .212 .413 .118 1.000 -.055 -.087

Glucose -.086 .150 -.036 .004 -.103 -.096 -.023 -.055 1.000 -.049

Uric Acid 0.57 -.029 -.119 -.050 .108 .119 -.014 -.087 -.049 1.000

Table 3. Anti-image correlation matrix of various chemical tests increasing the risk of IHD
Cholest- High Trigly- Apo Apo Low Phosp- Total Glucose Uric
erol Density cerides Protein Protein density holipids lipid Acid
lipopro- A-1 B Lipo-
tein protein
Cholesterol .7595 .0111 -.0888 .1152 -.0259 -.4360 -.0425 -.2123 .0298 -.0238

High Density .0111 .8157 .0817 -.1107 .0375 .1067 -.0405 .0763 -.1284 .0130
Lipoprotein
Triglycerides -.088 .0817 .7346 .1792 -.0493 .0137 .0644 -.1747 .0081 .1312
Apo Protein .1152 -.1117 .1792 .7454 .1118 .0793 -.0395 -.0922 .0506 .0296
A-1
Apo Protein -0.259 .0375 -.0493 .1118 .7692 -.2550 .1456 -.0531 .0645 -.0732
B
Low density -.4360 .1067 .0137 .0793 -.2550 .7002 -.2786 -.1978 .0136 -.1145
Lip protein
Phospholipids -.0425 -.0405 .0644 -.0395 .1456 -.2786 .5752 -.0009 .0106 .0505
Total lipid -.2123 .0763 -.1747 -.0922 -.0531 -.1979 -.0009 .7828 -.0046 .1279

Glucose .0298 -.1284 .0081 .0506 .0645 .0136 .0106 -.0046 .6881 .0356

Uric Acid -.0238 .0130 .1312 .0296 -.0732 -.1145 .0505 .1279 .0356 .4974

Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology 5
Application of Principal Component Analysis (PCA) to Medical Data

mg/dl while those who belonged to control group their 3.4 Initial and Extraction Communalities
average value was recorded to be 181.80mg/dl. Total lipid Analysis
found in ischemic heart disease patients was found to be
922.67 mg/dl whereas it was recorded in control group as In principal component analysis initial and extraction
771.75 mg/dl. Average value of glucose found in ischemic communalities for variables are the variances accounted
heart disease patients was noted as 88.66 mg/dl while for the factors. For variance and co-variance analysis ini-
in control group this was found to be 85.80 mg/dl. The tial communalities remain always equal to 1.0. For the
average value of uric acid found in ischemic heart dis- extraction communalities, these values are the proportion
ease patients was found to be 6.39 mg/dl while in control to variance of each variable by the rest of the variables.
group it was recorded to be 5.17 mg/dl. The extraction communalities are the sum of squared
When Table 1 studied, it was found that the average loadings for a variable across factors, in Table 4, the
values of cholesterol, triglyceride, Apo protein B, low den- extraction communalities for cholesterol are (0.770)2
sity lipoprotein , phospholipids, total lipid, glucose and + (0.180)2 + (0.0823)2 + (0.0877)2 = 0.639. That was
uric acid were higher in ischemic heart disease patients Factor1+Factor2+Factor3+Factor4 accounted for 63.9%
as compared to those of control group while high density of the variance calculated for cholesterol. Variances cal-
lipoprotein and Apo protein A-1 were found to be higher culated for high density lipoprotein, triglyceride, Apo
in control group. protein A-1, Apo protein B, low density lipoprotein, phos-
Though the control group has the normal values but pholipids, total lipid, glucose and uric acid were 42.1%,
due to increase in average value of cholesterol the people are 57.6%, 435.4%, 74.7%, 45.5%, 64.1%, 53.6%, 81.6% and
under heavy risk of increasing the ischemic heart disease. 72.3% respectively.
As already stated in the last section that those factors
which have small variances do not fit in the factor solution
3.2 Correlation Analysis
and should be discarded from the analysis, our empirical
Correlation coefficients between ten chemical tests for results presented in Table 4 shows that high density lipo-
Ischemic Heart Disease are shown in Table 2. It could be protein and Apo protein B can be easily discarded from
observed from this table that cholesterol has positively the analysis due to explaining a very small proportion of
high correlation with low density lipoprotein (0.606) and the variance. Since this proportion of variance explained
it is moderately correlated with total lipid (0.421). There by a factor is calculated by dividing the sum of square
is negative correlation between high density lipoprotein loadings with the number of variables used, hence for
and low density lipoprotein. The observed correlation these two factors, it was calculated as 99.1% whereas the
of triglyceride with total lipid (0.271) is positively weak. covariance for these factors was calculated as 100%.
Apo protein A-1 has negatively weak whereas Apo pro-
tein B has positively weak correlation with low density 3.5 Component Analysis
lipoprotein. There is highly positive correlation observed
between low density lipoprotein and total lipid. Table 5 shows the eigenvalues and total variance explained
for our factor solution. There are ten variables used in the
sample, because the aim of factor analysis is to precise
3.3 Anti Image Correlation Analysis
the pattern of correlation with a factor as possible. Since
Discussing Table 3, it becomes clear that almost all the every eigenvalue corresponds to a different factor (usually
values at the lower or upper half of the main diagonal are only factor with large eigenvalues are retained). In a good
smaller, i.e. close to zero, which indicates that variables factor analysis, these few factors defined the whole corre-
are relatively free from unexplained correlations. In the lation matrix. When no limit is placed on the number of
main diagonal of anti-image correlation matrix, each factors eigenvalues are computed for each of ten possible
value describes the measure of sampling adequacy for components as shown in Table 5.
the respective item. Values less than 0.5 may indicate that It is quite clear from the Table 6 that the first four com-
a variable does not seem to fit with the structure of the ponents has their eigenvalues over 1 and are large enough
variables. Table 3 all the variables in main diagonal have to be retained, their variances are 26.65%, 12.02%, 11.79%
values more than 0.5, which reveals that, the variables and 10.19% respectively. These four components describe
have adequate samples. the 60.67% of the total variance.

6 Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology
Naeem Ahmed Qureshi, Velo Suthar, Habibullah Magsi, Muhammad Javed Sheikh, Mubeena Pathan and Barkatullah Qureshi

Table 4. Relationship among loadings, communalities, variance, co-variance of orthogonally rotated factors
Variables Initial Factor 1 Factor 2 Factor 3 Factor 4 Extraction
Cholesterol 1.000 .770 .180 0.0823 0.0877 0.639765

High density lipoprotein 1.000 -.452 .262 0.187 0.337 0.421486


Triglycerides 1.000 .447 -.545 0.282 0.0154 0.576595

Apo Protein A-1 1.000 -.448 .363 0.197 -0.376 0.512658


Apo Protein B 1.000 .548 -.133 -0.350 0.122 0.455377
Low density lipoprotein 1.000 .804 .307 -0.0146 0.0798 0.747246
Phospholipids 1.000 .276 .681 0.315 -0.0459 0.641269
Total Lipid 1.000 .633 -0.002477 0.354 -0.104 0.536827
GLUCOSE 1.000 -.198 -0.009493 0.328 0.818 0.816002
Uric Acid 1.000 0.07920 .311 -0.752 0.235 0.723723
Sum of square of loadings 2.666779 1.202354 1.180417 1.021397 6.070948
Proportion of Variance 0.266678 0.120235 0.118042 0.10214 0.607095
Proportion of co-variance 0.439269 0.19805 0.194437 0.168243 1.000000

Table 5. Component variability percentage correlation analysis by using eigenvalues


Component Initial Eigenvalues Extraction Sums of Squared Loadings
Total %of variance Cumulative % Total % of variance Cumulative %
1 2.666 26.658 26.658 2.666 26.658 26.658
2 1.202 12.022 38.680 1.202 12.022 38.680
3 1.179 11.794 50.474 1.179 11.794 50.474
4 1.020 10.196 60.670 1.020 10.196 60.670
5 .869 8.694 69.364
6 .791 7.906 77.270
7 .738 7.382 84.652
8 .649 6.494 91.146
9 .531 5.309 96.455
10 .355 3.545 100.000

Table 6. Analysis of component score coefficient matrix


Variables Components
1 2 3 4
Cholesterol .289 .149 .070 .086
High Density Lipoprotein -.169 .218 .158 .330
Triglyceride .168 -.453 .239 .015
Apo Protein A-1 -.168 .302 .167 -.369
Apo protein B .205 -.111 -.296 .120
Low Density Lipoprotein .302 .255 -.012 .078
Phospholipids .103 .566 .267 -.045
Total Lipid .237 -.002 .300 -.102
Glucose -.074 -.008 .278 .803
Uric Acid .030 .259 -.638 .226

Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology 7
Application of Principal Component Analysis (PCA) to Medical Data

3.6 Component Score Coefficient Analysis in the component 3 were triglyceride and total lipid. The
variables had high weight and belonged to component 4
The component loading matrix is a matrix of correlation
were high density lipoprotein and glucose. The important
between components and variables. The first column is
finding of this study is that the average cholesterol level
the correlation between the first component and each
which is considered to be main factor increasing the risk
variables, the second column is the correlation between
of Ischemic Heart Disease is found to be higher even in
the second component and each variable and so on. Table
control group than normal values.
6 reveals that there were few strong correlation observed
between components and their respective variables. A
component is interpreted from variables that are highly 5. References
correlated with it-that had loadings on it.
1. Anderson TW. Asymptotic theory for principal component
In the evident from Table 6 that the variables that
analysis. Annals of Mathematical Statistics. 1963; 34:122-
had high loading are assumed to belong with the com-
48. Crossref.
ponent 1, i.e., cholesterol (0.289), Apo protein B (0.205) 2. Sharma UM. Hybrid Feature Based Face Verification and
and low density lipoprotein (0.302). Likewise in compo- Recognition System Using Principal Component Analysis
nent 2 the variables had the high weights were found to and Artificial Neural Network. Indian Journal of Science
be Apo protein A-1 (0.302), phospholipids (0.566) and and Technology. 2015 Jan; 8(S1):1-6. Crossref.
uric acid (0.259). Similarly, triglyceride (0.239) and total 3. Sindhubargavi R, Sindhu MY, Saravanan R. Spectrum
lipid (0.300) belonged to component 3. In the same way Sensing Using Energy Detection Technique for Cognitive
in component 4 the variables with high scores were high Radio Networks Using PCA Technique. Indian Journal of
density lipoprotein (0.330) and glucose (0.803). Science and Technology. 2014 Apr; 7(S4):40-5.
4. Linting M, Meulman JJ, Groenen PJ, van der Koojj AJ. Non-
linear principal components analysis: introduction and
4. Conclusion application. Psychological Methods. 2007; 12(3):336-58.
Crossref. PMid:17784798.
The step wise principal Component Analysis was used to 5. Kim KI, Park SH. Kernel principal component analysis for
explore the effects of these chemical tests which are sig- texture classification. IEEE Signal Processing Letters. 2001;
nificantly affecting the patients namely cholesterol, high 8(2):39-41. Crossref.
density lipoprotein, triglyceride, Apo protein A-1, Apo 6. Jayabalan L, Ganapathi P. Anomaly based Malicious Traffic
protein B, low density lipoprotein, phospholipids, total Identification using Kernel Extreme Machine Learning
lipid, glucose and uric acid of patients who were suffer- (KELM) Classifier and Kernel Principal Component
ing from Ischemic Heart Disease. In the case of finding Analysis (KPCA). Indian Journal of Science and Technology.
correlations between variables, it was observed that 2016 Apr; 9(13):1-11. Crossref.
Cholesterol was highly correlated with low density lipo- 7. D’Urso P, Paolo G. A least squares approach to Principal
Component Analysis for interval valued data. Economics
protein and moderately correlated with total lipid. Anti
and Statistics Discussion Paper. 2002; p. 1-29.
image correlations interpreted that the variables were rel- 8. Sanguansat P. Croatia: InTech Publishers: Principal
atively free from unexplained correlation. It was observed Component Analysis: Multidisciplinary Applications.
from communalities extractions that high density lipo- 2012. Crossref.
protein and Apo protein B has small variances and do not 9. Omucheni DL, Kaduki KA, Bulimo WD, Angeyo HK.
fit in the factor solution. To know the variability between Application of principal component analysis to multi-
the components, it was found that first four components spectral-multimodal optical image analysis for malaria
account for exactly 60.67% of the total variability. These diagnostics. Malaria Journal. 2014; 13:1-11. Crossref
four components adequately describe the whole correla- .PMid:25495235 PMCid:PMC4364660.
10. Nandi D, Ashour AS, Samanta S, Chakraborty S,
tion matrix. Finally it was obvious to see that cholesterol,
Mohammad AMS, Dey N. Principal component analysis in
Apo protein B and low density lipoprotein had high value
medical image processing: a study. International Journal of
and belongs to the components 1. In component 2, it was Image Mining. 2015; 1(1):65-86.
found that Apo protein A-1, phospholipids and uric acid 11. Murthy HSN, Meenakshi M. Efficient Algorithm for
had the high scores. The variables that had high loadings Early Detection of Myocardial Ischemia using PCA based

8 Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology
Naeem Ahmed Qureshi, Velo Suthar, Habibullah Magsi, Muhammad Javed Sheikh, Mubeena Pathan and Barkatullah Qureshi

Features. Indian Journal of Science and Technology. 2016 13. Murali M. Principal Component Analysis based Feature
Oct; 9(39):1-13. Vector Extraction. Indian Journal of Science and
12. Snedecor GW, Cochran WG. USA: Iowa State University Technology. 2015 Dec; 8(35):1-4. Crossref.
Press: Statistical Methods. 8th ed. 1989.

Vol 10 (20) | May 2017 | www.indjst.org Indian Journal of Science and Technology 9

View publication stats

You might also like