Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
A Hybrid PSO-SVM Approach for Haplotype Tagging SNP Selection Problem

A Hybrid PSO-SVM Approach for Haplotype Tagging SNP Selection Problem

Ratings: (0)|Views: 102 |Likes:
Published by ijcsis
Abstract—Due to the large number of single nucleotide polymorphisms (SNPs), it is essential to use only a subset of all SNPs called haplotype tagging SNPs (htSNPs) for finding the relationship between complex diseases and SNPs in biomedical research. In this paper, a PSO-SVM model that hybridizes the particle swarm optimization (PSO) and support vector machine (SVM) with feature selection and parameter optimization is proposed to appropriately select the htSNPs. Several public datasets of different sizes are considered to compare the proposed approach with other previously published methods. The computational results validate the effectiveness and performance of the proposed approach and the high prediction accuracy with the fewer htSNPs can be obtained.

Keywords : Single Nucleotide Polymorphisms (SNPs), Haplotype Tagging SNPs (htSNPs), Particle Swarm Optimization (PSO), Support Vector Machine (SVM).
Abstract—Due to the large number of single nucleotide polymorphisms (SNPs), it is essential to use only a subset of all SNPs called haplotype tagging SNPs (htSNPs) for finding the relationship between complex diseases and SNPs in biomedical research. In this paper, a PSO-SVM model that hybridizes the particle swarm optimization (PSO) and support vector machine (SVM) with feature selection and parameter optimization is proposed to appropriately select the htSNPs. Several public datasets of different sizes are considered to compare the proposed approach with other previously published methods. The computational results validate the effectiveness and performance of the proposed approach and the high prediction accuracy with the fewer htSNPs can be obtained.

Keywords : Single Nucleotide Polymorphisms (SNPs), Haplotype Tagging SNPs (htSNPs), Particle Swarm Optimization (PSO), Support Vector Machine (SVM).

More info:

Published by: ijcsis on Oct 10, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

07/21/2013

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No.6, 2010
A Hybrid PSO
-
SVM Approach for HaplotypeTagging SNP Selection Problem
 
Min-Hui Lin
Department of Computer Science and InformationEngineering, Dahan Institute of Technology,Sincheng, Hualien County , Taiwan, Republic of ChinaChun-Liang Leu
 
Department of Information Technology, Ching KuoInstitute of Management and Health,Keelung , Taiwan, Republic of China
 Abstract
—Due to the large number of single nucleotidepolymorphisms (SNPs), it is essential to use only a subset of allSNPs called haplotype tagging SNPs (htSNPs) for finding therelationship between complex diseases and SNPs in biomedicalresearch. In this paper, a PSO-SVM model that hybridizes theparticle swarm optimization (PSO) and support vector machine(SVM) with feature selection and parameter optimization isproposed to appropriately select the htSNPs. Several publicdatasets of different sizes are considered to compare the proposedapproach with other previously published methods. Thecomputational results validate the effectiveness and performanceof the proposed approach and the high prediction accuracy withthe fewer htSNPs can be obtained.
 
 Keywords
Single Nucleotide Polymorphisms (SNPs), Haplotype Tagging SNPs (htSNPs), Particle Swarm Optimization(PSO), Support Vector Machine (SVM).
I.
 
INTRODUCTIONThe large number of single nucleotide polymorphisms(SNPs) in the human genome provides the essential tools forfinding the association between sequence variation andcomplex diseases. A description of the SNPs in eachchromosome is called a haplotype. The string element of eachhaplotype is 0 or 1, where 0 denotes the major allele and 1denotes the minor allele. The genotype is the combinedinformation of two haplotypes on the homologouschromosomes and is prohibitively expensive to directlydetermine the haplotypes of an individual. Usually, the stringelement of a genotype is 0, 1, or 2, where 0 represents themajor allele in homozygous site, 1 represents the minor allelein homozygous site, and 2 is in the heterozygous site. Thegenotyping cost is affected by the number of SNPs typed. Inorder to reduce this cost, a small number of haplotype taggingSNPs (htSNPs) which predicts the rest of SNPs are needed.The haplotype tagging SNP selection problem has becomea very active research topic and is promising in diseaseassociation studies. Several computational algorithms havebeen proposed in the past few years, which can be divided intotwo categories: block-based and block-free methods. Theblock-based methods [1-2] firstly partition human genome intohaplotype blocks. The haplotype diversity is limited and thensubsets of tagging SNPs are searched within each haplotypeblock. A main drawback of block-based methods is that thedefinition of blocks is not a standard form and there is noconsensus about how these blocks should be partitioned. Thealgorithmic framework for selecting a minimum informativeset of SNPs avoiding any reference to haplotype blocks iscalled block-free methods [3]. In the literature [4-5], featureselection technique was adopted to solve for the tagging SNPsselection problem and achieved some promising results.Feature selection algorithms may be widely categorizedinto two groups: the filter approach and the wrapper approach.The filter approach selects highly ranked features based on astatistical score as a preprocessing step. They are relativelycomputationally cheap since they do not involve the inductionalgorithm. Wrapper approach, on the contrary, directly uses theinduction algorithm to evaluate the feature subsets. It generallyoutperforms filter method in terms of classification accuracy,but computationally more intensive. Support Vector Machine(SVM) [6] is a useful technique for data classification. Apractical difficulty of using SVM is the selection of parameterssuch as the penalty parameter
of the error term and the kernelparameter
γ 
in RBF kernel function. The appropriate choice of parameters is to get the better generalization performance.In this paper, a hybrid PSO-SVM model that incorporatesthe Particle Swarm Optimization (PSO) and Support VectorMachine (SVM) with feature selection and parameteroptimization is proposed to appropriately select the htSNPs.Several public benchmark datasets are considered to comparethe proposed approach with other published methods.Experimental results validate the effectiveness of the proposedapproach and the high prediction accuracy with the fewerhtSNPs can be obtained. The remainder of the paper isorganized as follows: Section 2 introduces the problemformulation. Section 3 describes the PSO and SVM classifier.In Section 4, the particle representation, fitness measurement,and the proposed hybrid system procedure are presented. Threepublic benchmark problems are used to validate the proposed
60http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No.6, 2010
approach and the comparison results are described in Section 5.Finally, conclusions are made in Section 6.II.
 
PROBLEM FORMULATIONAs shown in Figure 1, assume that dataset
consists of 
n
 haplotypes
1
{}
iin
h
, each with
 p
different SNPs
1
{}
 jjp
S
,
is
n
×
 p
matrix. Each row in
indicates the haplotype
i
h
and eachcolumn in
represents the SNP
 j
S
. The element
,
ij
denotesthe
 j
-th SNP of 
i
-th haplotype,
,
{0,1}
ij
. Our goal is todetermine a minimum size g set of selected SNPs (htSNPs){},{1,2,...,}
Vvkp
=
,
g
=
, in which each randomvariable
v
corresponding to the
-th SNP of haplotypes in
,to predict the remaining unselected ones with a minimumprediction error. The size of 
is smaller than a user-definedvalue
 R
(
gR
), and the selected SNPs are called haplotype
tagging
SNPs (htSNPs) while the remaining unselected onesare named as
tagged 
SNPs. Thus, the selection set
of htSNPsis based on how well to predict the remaining set of theunselected SNPs and the number
g
of selected SNPs is usuallyminimized according to the prediction error by calculating theleave-one-out cross-validation (LOOCV) experiments [7].
1211,11,21,1,11,12,12,22,2,12,2,1,2,,1,1,11,21,1,11,1,1,2,,1,
 jpp jpp jppiiijipipinnnjnpnpnnnnjnpnpnnp
SSSSSddddhddddhddddhddddhddddh
×
L LL LL LM M O M N M MML LM M N M O M MML LL L
 
Figure 1 The haplotype tagging SNP Selection Problem
.III.
 
RELATED WORKS
 A.
 
Particle Swarm Optimization
The PSO is a novel optimization method originallydeveloped by Kennedy and Eberhart [8]. It models theprocesses of the sociological behavior associated with birdflocking and is one of the evolutionary computation techniques.In the PSO, each solution is a ‘bird’ in the flock and is referredto as a ‘particle’. A particle is analogous to a chromosome inGA. Each particle traverses the search space looking for theglobal optimum. The basic PSO algorithm is as follow:
11122
()()
kkkkkidididididi
vwvcrpbxcrgbx
+
= ⋅ + +
(1)
11
kkididi
 xvx
+ +
= +
(2)where1,2,...,
dD
=
, 1,2,...,
iS
=
, and
 D
is the dimension of the problem space,
S
is the size of population,
is the iterativetimes;
id 
v
is the
i
-th particle velocity,
id 
 x
is the current particlesolution,
id 
 pb
is the
i
-th particle best (
best 
 p
) solution achievedso far;
id 
gb
is the global best (
best 
g
) solution obtained so far byany particle in the population;
1
and
2
are random values inthe range [0,1], both of 
1
c
and
2
c
are learning factors, usually
12
2
cc
= =
,
w
is a inertia factor. A large inertia weightfacilitates global exploration, while a small one tends to localexploration. In order to achieve more refined solution, ageneral rule of thumb suggests that the initial inertia value hadbetter be set to the maximum
max
0.9
w
=
, and gradually downto the minimum
min
0.4
w
=
.According to the searching behavior of PSO, the gbestvalue will be an important clue in leading particles to the globaloptimal solution. It is unavoidable for the solution to fall intothe local minimum while particles try to find better solutions.In order to allow the solution exploration in the area to producemore potential solutions, a mutation-like disturbance operationis inserted between Eq. (1) and Eq. (2). The disturbanceoperation random selects
dimensions (1
problemdimensions) of 
m
particles (1
m
particle numbers) to putGaussian noise into their moving vectors (velocities). Thedisturbance operation will affect particles moving toward tounexpected direction in selected dimensions but not previousexperience. It will lead particle jump out from local search andfurther can explore more un-searched area.According to the velocity and position updated formulamentioned above, the basic process of the PSO algorithm isgiven as follows:
1.)
 
Initialize the swarm by randomly generating initialparticles.
2.)
 
Evaluate the fitness of each particle in the population.
3.)
 
Compare the particle’s fitness value to identify the bothof 
best 
 p
and
best 
g
values.
4.)
 
Update the velocity of all particles using Equation (1).
5.)
 
Add disturbance operator to moving vector (velocity).
6.)
 
Update the position of all particles using Equation (2).
7.)
 
Repeat the Step 2 to Step 6 until a termination criterionis satisfied (e.g., the number of iteration reaches the pre-definedmaximum number or a sufficiently good fitness value isobtained).The authors [8] proposed a discrete binary version to allowthe PSO algorithm to operate in discrete problem spaces. In thebinary PSO (BPSO), the particle’s personal best and globalbest is updated as in continuous value. The major differentbetween discrete PSO with continuous version is that velocitiesof the particles are rather defined in terms of probabilities that abit whether change to one. By this definition, a velocity mustbe restricted within the range
minmax
[,]
V
. If 
1minmax
(,)
id 
vV
+
 then
11maxmin
max(min(,),)
kidi
vVv
+ +
=
. The new particle positionis calculated using the following rule:
61http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No.6, 2010
If 
1
()()
id 
randSv
+
<
, then
1
1
id 
 x
+
=
; else
1
0
id 
 x
+
=
(3), where
1
1
1()1
id 
id v
Sve
+
+
=+
(4)The function ()
id 
Sv
is a sigmoid limiting transformation andrand() is a random number selected from a uniform distributionin [0, 1]. Note that the BPSO is susceptible to sigmod functionsaturation which occurs when velocity values are either toolarge or too small. For a velocity of zero, it is a probability of 50% for the bit to flip.
 B.
 
Support Vector Machine Classifier 
SVM starts from a linear classifier and searches the optimalhyper-plane with maximal margin. The main motivatingcriterion is to separate the various classes in the training setwith a surface that maximizes the margin between them. It is anapproximate implementation of the structural risk minimizationinduction principle that aims to minimize a bound on thegeneralization error of a model.Given a training set of instance-label pairs(,),1,2,...,
ii
 xyim
=
where
ni
 xR
and{1,1}
i
 y
+
. Thegeneralized linear SVM finds an optimal separating hyper-plane
()
 fxwxb
= +
by solving the following optimizationproblem:
,,1
1Minimize2:()10,0
miwbiiiii
wwSubjecttoywxb
ξ 
ξ ξ ξ 
=
+< ⋅ > + + −
(5)where
is a penalty parameter on the training error, and
i
ξ 
isthe non-negative slack variables. This optimization model canbe solved using the Lagrangian method, which maximizes thesame dual variables Lagrangian ()
 D
 L
α 
(6) as in the separablecase.
1,11
1Maximize()2:0,1,2,...,0
mm Diijijijiijmiiii
 LyyxxSubjecttoCimandy
α 
α α α α α α 
= ==
= < >= =
(6)To solve the optimal hyper-plane, a dual Lagrangian()
 D
 L
α 
must be maximized with respect to non-negative
i
α 
 under the constraint
1
0
miii
 y
α 
=
=
and0
i
α 
. The penaltyparameter
is a constant to be chosen by the user. A largervalue of 
corresponds to assigning a higher penalty to theerrors. After the optimal solution
*
i
α 
is obtained, the optimalhyper-plane parameters
*
w
and
*
b
can be determined. Theoptimal decision hyper-plane
**
(,,)
 fxb
α 
can be written as:
******1
(,,)
miiii
 fxbyxxbwxb
α α 
=
= < > + = +
(7)Linear SVM can be generalized to non-linear SVM via amapping function
Φ
, which is also called the kernel function,and the training data can be linearly separated by applying thelinear SVM formulation. The inner product(()())
ij
 xx
Φ Φ
iscalculated by the kernel function (,)
ij
kxx
for given trainingdata. By introducing the kernel function, the non-linear SVM(optimal hyper-plane) has the following forms:
****1**1
(,,)()()(,)
miiiimiiii
 fxbyxxb ykxxb
α α α 
==
= < Φ ⋅Φ > += +
(8)Though new kernel functions are being proposed byresearchers, there are four basic kernels as follows.
 
Linear:(,)
ijij
kxxxx
=
(9)
 
Polynomial:(,)(),0
Tijij
kxxxx
γ γ 
= + >
(10)
 
RBF:
2
(,)exp(||||),0
ijij
kxxxx
γ γ 
= >
(11)
 
Sigmoid:(,)tanh()
ijij
kxxxx
γ 
= +
(12)where
γ 
,
and
are kernel parameters. Radial basis function(RBF) is a common kernel function as Eq. (11). In order toimprove classification accuracy, the kernel parameter
γ 
in thekernel function should be properly set.IV.
 
METHODSAs the htSNPs selection problem mentioned above inSection 2, the notations and definitions are used to present ourproposed method. In the dataset
of 
n
×
 p
matrix, each row(haplotypes) can be viewed as a learning instance belonging toa class and each column (SNPs) are attributes or features basedon which sequences can be classified into class. Given thevalues of 
g
htSNPs of an unknown individual
 x
and the knownfull training samples from
, a SNP prediction process can betreated as the problem of selecting tagging SNPs as a featureselection problem to predict the non-selected tagging SNPs in
 x
.Thus, the tagging SNPs selection can be transformed to solvefor a binary classification of vectors with
g
coordinates byusing the support vector machine classifier. Here, an effectivePSO-SVM model that hybridizes the particle swarmoptimization and support vector machine with feature selectionand parameter optimization is proposed to appropriately selectthe htSNPs. The particle representation, fitness definition,disturbance strategy for PSO operation and system procedurefor the proposed hybrid model are described as follows.
 A.
 
Particle Representation
The RBF kernel function is used in the SVM classifier toimplement our proposed method. The RBF kernel functionrequires that only two parameters,
and
γ 
should be set. Usingthe RBF kernel for SVM, the parameters
,
γ 
and SNPsviewed as input features which must be optimized
62http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->