|Views: 16
|Likes: 1

Published by ijcsis

Abstract— This paper presents a probabilistic approach for DNA sequence analysis. A DNA sequence consists of an arrangement of the four nucleotides A, C, T and G. There are various representation schemes for a DNA sequence. This paper uses a representation scheme in which the probability of a symbol depends only on the occurrence of the previous symbol. This type of model is defined by two parameters, a set of states Q, which emit symbols and a set of transitions between the states. Each transition has an associated transition probability, aij, which represents the conditional probability of going to state j in the next step, given that the current state is i. Further, the paper combines the different types of classification classes using a Fuzzy composition relation. Finally a log-odd ratio is used for deciding to which class the given sequence belongs to.

Keywords-component; Transition Probability, Fuzzy Composition Relation, Log-Odd ratio

Keywords-component; Transition Probability, Fuzzy Composition Relation, Log-Odd ratio

Abstract— This paper presents a probabilistic approach for DNA sequence analysis. A DNA sequence consists of an arrangement of the four nucleotides A, C, T and G. There are various representation schemes for a DNA sequence. This paper uses a representation scheme in which the probability of a symbol depends only on the occurrence of the previous symbol. This type of model is defined by two parameters, a set of states Q, which emit symbols and a set of transitions between the states. Each transition has an associated transition probability, aij, which represents the conditional probability of going to state j in the next step, given that the current state is i. Further, the paper combines the different types of classification classes using a Fuzzy composition relation. Finally a log-odd ratio is used for deciding to which class the given sequence belongs to.

Keywords-component; Transition Probability, Fuzzy Composition Relation, Log-Odd ratio

Keywords-component; Transition Probability, Fuzzy Composition Relation, Log-Odd ratio

See more

See less

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8,, No. 6, 2010

Application of Fuzzy Composition Relation For DNASequence Classification

Amrita Priyam

Dept. of Computer Science and EngineeringBirla Institute of TechnologyRanchi, India.amrita.priyam@gmail.com

Abstract

—

Abstract

— This paper presents a probabilisticapproach for DNA sequence analysis. A DNA sequenceconsists of an arrangement of the four nucleotides A, C, Tand G. There are various representation schemes for aDNA sequence. This paper uses a representation scheme inwhich the probability of a symbol depends only on theoccurrence of the previous symbol. This type of model isdefined by two parameters, a set of states Q, which emitsymbols and a set of transitions between the states. Eachtransition has an associated transition probability, ,which represents the conditional probability of going tostate

j

in the next step, given that the current state is

i

.Further, the paper combines the different types of classification classes using a Fuzzy composition relation.Finally a log-odd ratio is used for deciding to which classthe given sequence belongs to.

ij

a

Keywords-component;

Transition Probability, Fuzzy CompositionRelation, Log-Odd ratio

I.

I

NTRODUCTION

A DNA sequence is a succession of the letters A, C, T and G.The sequences are any combination of these letters. A physicalor mathematical model of a system produces a sequence of symbols according to a certain probability associated withthem. This is known as a stochastic process, that is, it is amathematical model for a biological system which is governedby a set of probability measure. The occurrence of the letterscan lead us to the further study of genetic disorder. There arevarious representation schemes for a DNA sequence. Thispaper uses a representation scheme in which the probability of a symbol depends only on the occurrence of the previoussymbol and not on any other symbol before that. This type of model is defined by two parameters, a set of states Q, whichemit symbols and a set of transitions between the states. Eachtransition has on associated transition probability, , whichrepresents the conditional probability of going to state

j

fromstate i in the next step, given that the current state is

i

. Each

ij

a

B. M. Karan

+

, G. Sahoo

++

+

Dept. of Electrical and Electronics Engineering

++

Dept. of Information TechnologyBirla Institute of TechnologyRanchi, Indiaclass has a set of transition probabilities associated with it.This transition probability is the measure of going from onestate to another. Now, each class has a set of transitionprobability associated with it. We further group the similarclasses and their respective transition probability is mergedusing a fuzzy composition relation. Finally a log-odd ratio isused for deciding to which class the given sequence belongsto.II.

DNA

SEQUENCESDNA sequence is a succession of letters representing theprimary structure of a real or hypothetical DNA molecule orstrand, with the capacity to carry information as described bythe central dogma of molecular biology. There are 4nucleotide bases (A – Adenine, C – Cytosine, G – Guanine, T– Thymine). DNA sequencing is the process of determiningthe exact order of the bases A, T, C and G in a piece of DNA[3]. In essence, the DNA is used as a template to generate a setof fragments that differ in length from each other by a singlebase. The fragments are then separated by size, and the basesat the end are identified, recreating the original sequence of the DNA[8][9]. The most commonly used method of sequencing DNA the dideoxy or chain termination methodwas developed by Fred Sanger in 1977 (for which he won hissecond Nobel Prize). The key to the method is the use of modified bases called dideoxy bases; when a piece of DNA isbeing replicated and a dideoxy base is incorporated into thenew chain, it stops the replication reaction.Most DNA sequencing is carried out using the chaintermination method [4]. This involves the synthesis of newDNA strands on a single standard template and the randomincorporation of chain-terminating nucleotide analogues. Thechain termination method produces a set of DNA moleculesdiffering in length by one nucleotide. The last base in eachmolecule can be identified by way of a unique label.Separation of these DNA molecules according to size placesthem in correct order to read off the sequence.

145http://sites.google.com/site/ijcsis/ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8,, No. 6, 2010

III.

A

P

ROBABILISTIC APPROACH FOR SEQUENCEREPRESENTATION

A DNA sequence is essentially represented as a string of fourcharacters A, C, T, G and looks something likeACCTGACCTTACG. These strings can also be represented interms of some probability measures and using these measuresit can be depicted graphically as well. This graphicalrepresentation matches the Markov Hidden Model. A physicalor mathematical model of a system produces a sequence of symbols according to a certain probability associated withthem. This is known as a stochastic process [2]. There aredifferent ways to use probabilities for depicting the DNAsequences. The diagrammatical representation can be shownas follows:FIG 1: [The states of A, C, G and T.]For example, the transition probability from state G to state Tis 0.08, i,e,

08.0)|(

1

===

−

G xT xP

ii

In a given sequence

x

of length

L

,

x

1

, x

2

, …… x

L

, represent thenucleotides. The sequence starts at the first state

x

1

, and makessuccessive transitions to

x

2

, x

3

and so on, till

x

L

. Using Markovproperty [6], the probability of

x

L

, depends on the value of only the previous state,

x

L-1

, not on the entire previoussequence. This characteristic is known as Markov property [5]and can be written as:

)()|().......|()|()(

112211

xP x xP x xP x xP xP

L L L L

−−−

=

)|()(

121

−=

∏

=

ii Li

x xP xP

(1)In Equation (1) we need to specify

P(x

1

),

the probability of thestarting state. For simplicity, we would like to model this as atransition too. This can be done by adding a begin state,denoted by

0

, so that the starting state becomes

x

0

=0.

Now considering , the transition probability we canrewrite (1) as

i x x

i

a

1

−

∏

=

−

=

Li x x

ii

a xP

1

1

)(

(2)If there are

n

classes, then we calculate the probability of asequence

x

being in all the classes. To overcome thisdrawback we use Fuzzy composition relation. That is, wedivide the

n

classes into different groups based on theirsimilarities. So, if out of

n

classes,

m

are similar then they aretreated as one group and their individual transition probabilitytables are merged using the fuzzy composition relation. Theremaining (

n – m)

classes are similarly grouped. Lets say, if there are two classes

R1

and

R2

, the Fuzzy compositionrelation between

R1

and

R2

[6][7] can be written as follows:

)))(),(((

2121

y R x R Min Max R R

=

o

(3)

Different class representation Grouping of similar classes

Fig 2: Grouping of similar classesA table is then constructed representing the entire (

n – m)

similar classes. From this table we compute the probabilitythat a sequence

x

belongs to a given group using the followingequation:

∑

=−−−+

=−+

Liiiii

xa xa xP xP

111

log)|()|(log

(4)Here “+” represents transition probability of the sequencebelonging to one of the classes using fuzzy compositionrelation and “-“ represents the transition probability of thesame for another class [1].If this ratio is greater than zero then we can say that thesequence

x

is from the first class else from the other one.An Example:Let us consider an example for applying this classificationmethod. We have taken into consideration the Swine fludata.[11] The different categories of the Swine flu data areshown as

R

1

,

R

2

and

R

3

.

R

1

,

R

2

and

R

3

shows the Transition Probability of Type 1, Type2 and Type 3 varieties of Avian Flu.

146http://sites.google.com/site/ijcsis/ISSN 1947-5500

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8,, No. 6, 2010

A C T G

=

1

R

GT C A

⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡

06.005.004.008.0
07.005.005.006.0
02.005.004.009.0
08.009.006.013.0

A C T G

=

2

RGT C A

⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡

06.004.004.008.0
07.006.005.006.0
02.004.004.008.0
08.009.006.013.0

A C T G

=

3

RGT C A

⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡

07.004.006.008.0
09.006.006.005.0
02.006.003.008.0
08.008.005.008.0

Using the Fuzzy composition relation technique on

R

1

and

R

2

and then the result of the application with relation

R

3

we getthe final table for the Swine Flu class as:A C T G

GT C A

⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡

08.008.006.008.0
07.007.006.007.0
09.008.006.008.0
09.008.006.008.0

Similarly, we can repeat the same procedure for another classStaphylococcus.

X

1

,

X

2

and

X

3

shows the TransitionProbability of Type 1, Type 2 and Type 3 varieties of Staphylococcus.A C T G

=

1

X GT C A

⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡

05.003.009.008.0
07.008.006.004.0
08.006.005.006.0
06.007.005.008.0

A C T G

=

2

X GT C A

⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡

05.004.009.007.0
07.007.005.004.0
07.006.005.007.0
06.006.006.009.0

A C T G

=

3

X

GT C A

⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡

07.006.009.006.0
09.004.006.003.0
08.006.006.005.0
04.007.005.007.0

Applying Fuzzy composition relation to these tables we getthe final table asA C T G

GT C A

⎥⎥⎥⎥⎦⎤⎢⎢⎢⎢⎣⎡

02.004.008.005.0
06.007.003.004.0
06.007.009.008.0
06.005.006.007.0

Suppose we are given the sequence

x

which is to be classifiedas either falling into any of the given classes and say

x =CGCG

From the final fuzzy composition table the log odds ratio of this sequence is:

032270.008.009.0log07.006.0log08.009.0log

=++

Now, since this ratio is greater than 0, we can conclude thatthe input sequence x belongs to the class Avian Flu. If furtherclassification on the data is needed we will then consult theindividual transition probabilities for all the three types.

CONCLUSION

In this paper we have used a probabilistic function for theMarkov Property. We have applied this for probabilisticdetermination in the case of Avian flu virus andStaphylococcus. The paper also presented a way foridentifying particular classes of genes or proteins. A giveninput sequence can belong to either of the given classes. Byusing a transition probability measure, one had to determine avalue for each class even though they were similar. The paperpresented a scheme such that the similar classes were mergedby using the fuzzy composition relation and now instead of calculating each individual probability measure, one measure

147http://sites.google.com/site/ijcsis/ISSN 1947-5500