You are on page 1of 4

,&633URFHHGLQJV

A Novel Adaptive Filtering Approach for Genomic


Signal Processing
Baoshan Ma, Dongdong Qu, Yi-Sheng Zhu
College of Information Science and Technology, Dalian Maritime University, Dalian, Liaoning, 116026, China
mabaoshan@dlmu.edu.cn, ericqu@dlmu.edu.cn, yszhu@dlmu.edu.cn

Abstract—With the enormous amount of biological data that is multistage filter is designed to process the indicator sequences,
available in the public domain, signal processing plays an in order to reduce the background noise and obtain the
important role in genomic and proteomic data processing. Digital prediction curve of coding areas [3]. The other methods of
filters have been applied to predict genes and proteins, but the DSP have been applied for identifying protein coding regions
filters need to be redesigned when the periodic behavior or
characteristic frequency is changed. In this paper, we propose a
[4, 5]. In [6] the hotspots in proteins have been predicted
novel approach based on adaptive filtering theory which can according to the characteristic frequency of the protein
identify genes or proteins in a unified framework. At first, we sequence.
review the popular Voss representation which maps the The paper is organized as follows. Section II reviews the
alphabetic DNA sequence into the digital series. Secondly, a novel Voss representation for DNA digital mapping, and the period
adaptive filtering scheme for genomic signal processing with the behavior of biological sequences is discussed. In Section III,
periodical behavior of biological sequences is proposed, which the novel adaptive filtering approach for genomic signal
can analyze and predict the biological function regions that we processing is proposed. The proposed method is applied in
are interested in. Thirdly, the adaptive filtering approach is identifying protein coding segments from a real DNA sequence
applied to identify the exons in a DNA sequence according to in Section IV. At last, conclusions are presented in section V.
period-3 property of protein coding regions. The prediction
curves of the exons are obtained with the Least Mean Square
(LMS), the Recursive Least Squares (RLS) and the Kalman II. THE VOSS REPRESENTATION AND THE PERIOD
filtering algorithm. It is shown that the proposed method is useful BEHAVIOR OF BIOLOGICAL SEQUENCES
for genomic signal processing.
A. the Voss Representation
Keywords-genomic signal processing; Voss mapping; period The methods based on DSP have been applied to analyze
property; protein coding region; adaptive filter and identify the genes in a DNA sequence. The first thing is
to map the alphabetic sequence into the digital series. At
I. INTRODUCTION present, one of the most popular mapping schemes is the Voss
Genomic Signal Processing (GSP) is the engineering representation. A DNA sequence with the length of N, (N is a
discipline that studies the processing of genomic signals. It positive integer) is expressed by four different binary indicator
encompasses various methodologies concerning expression sequences. When the nth base, n 1, 2, 3, ˜ ˜ ˜, N ,
profiles: detection, prediction, classification, control, and is K , K  { A, T , C , G} , the indicator for the base K is equal to 1
statistical and dynamical modeling of gene networks[1]. It can at the nth location. The other indicators are equal to 0 at the
be a very useful tool for processing enormous genomic and position n . For example, a single DNA strand is expressed by
proteomic data [2-6, 9]. These data contain deoxyribonucleic the alphabetic sequence s(n) as follows
acid (DNA), ribonucleic acid (RNA) and protein sequences. It ………ATCCCAAGTATAAGA………
is necessary to process DNA, RNA and protein sequences for The binary indicator sequence x A (n) of the base A is given
identifying the special biological function segments such as
exons, introns and hotspots. A DNA sequence is composed of by x A (n) ˜ ˜ ˜1000011001 01101 ˜ ˜ ˜ , where 1 shows the
four different nuclides (or bases) named as A, T, C and G. By presence of an A and 0 shows its absence. The indicator
mapping the alphabetic sequence of a DNA strand into a set of sequences for the other bases are defined similarly. This
digital signals, the techniques based on Digital Signal simple mapping of a DNA sequence is known as the Voss
Processing (DSP) can be applied to analyze the DNA sequence. representation [7]. From the biological point of view, the Voss
At present, traditional and modern signal processing representation records the occurrence of each individual base
techniques play an important role in these fields. For instance, K in a DNA sequence.
Fourier transform has been applied in identifying exons in B. the Period Behavior of Biological Sequences
genes. An optimization procedure has been used to improve
the traditional Fourier analysis performance in distinguishing The periodicities are the main hidden oscillating patterns
coding from non-coding regions in a DNA sequence [2]. The detected in the genomic sequences [8]. The period-3 property
digital filters have been applied for gene prediction. The is characteristic for the exons (protein coding segments) only.
___________________________________
978-1-4244-5900-1/10/$26.00 ©2010 IEEE

1805
The source of the approximately 10.5-base sequence period is Suppose T represents the periodicity of signal and f
twofold. On the one hand, the sequences coding for alpha- represents its frequency. As we know, T=1/f. Fig. 1 shows the
helical coil regions in protein sequences have the hidden 3.5 power spectrum of the sequence x(n) with unitary frequency.
amino acid repeat which appears as 10.5-base periodicity in The number of frequency samples is N/2 for satisfying the
the nucleotide sequences. On the other hand, deformability of sampling theorem. Let  be the unitary frequency. Hence,
DNA important for its folding in chromatin is facilitated by T=2/, i.e. the unitary frequency  indicates that the
periodical positioning of certain dinucleotides along the periodicity T of the signal is equal to 2/. For example, the
sequences, with the period close to 10.5 bases. There are some location of period-10.5 signal is at =2/10.5=4/21, and the
sequence features which are repeated at approximately 400- location of period-3 signal is at =2/3 in the spectrum.
base distances, nearly periodically. This is due to the general
segmented organization of the genomes, which appear to have
evolved by fusion of genome segments of nearly standard
sizes, close to typical 350 bases for eukaryotes and 440 bases
for prokaryotes. Respective half-units (about 200 bases) are
also frequently observed.
V. Veljkovi et al. have subjected the Electron-Ion
Interaction Potential (EIIP) sequences of several proteins to
Fourier spectral analysis [9]. They have observed that the
Discrete Fourier Transforms (DFTs) of the EIIP sequences of
the proteins belonging to a particular functional group share a
unique spectral component. The frequency of this spectral
component characterizes the protein function, hence, it has
been termed the characteristic frequency of the functional (a)
group. Thus, each protein function can be mapped onto a
unique frequency in the frequency domain. Some proteins
perform more than one function during their life cycle. For
such proteins, each function will correspond to a different
characteristic frequency.
|X ()|2

… … …
(b)
0 4/21 2/3 1  Figure 2. The spectrum of the periodical sequences, (a) the synthetic period-
3 signal, (b) the real period-3 signal (the second exon of the nucleotides
sequence F56F11.4 a)
Figure 1. Spectrum of sequence x(n) with unitary frequency
We calculate the Fourier transform with synthetic data and
In the DSP field, Fourier transform is useful to analyze the real genomic sequence. A synthetic periodic signal x(n) is
periodical signal such as 3-, 10.5-, 200-, 400-base genomic generated by repeating the numbers -1, 0, +1 with twenty
sequences or characteristic frequency of proteomic sequences. times. Fig. 2 (a) shows the power spectrum of x(n). Fig. 2 (b)
Let x(n) represent a discrete periodical signal, then its Fourier is the power spectrum of the real nucleotides sequence
transform X(k) is defined as F56F11.4a in C-elegans chromosome ċ (base number 7949-
N 1 14625, accession number AF099922). In both figures, the
X (k ) ¦ x ( n)e
n 0
 j 2 Skn / N
k 0,1,2,˜ ˜ ˜, N  1 (1) peaks are clear at unitary frequency 2/3, hence, we will predict
the biological function segments that we are interested in by
Where N is the length of the series x(n). The total energy of considering the period property of biological sequences at the
four indicator sequences of a DNA sequence is denoted by next step.
S (Z ) X A (Z )  X T (Z )  X C (Z )  X G (Z )
2 2 2 2
(2) III. THE ADAPTIVE FILTERING APPROACH FOR GENOMIC
SIGNAL PROCESSING
Where XF() is the Fourier Transform of the indictor sequence
xK(n) of a DNA sequence, K  { A, T , C , G} .

1806
Desired
period signal
Indicator
sequences d(n)
+
Biological xK(n) _
Voss Adaptive e(n)
Sequence s(n) representation filter
yK(n)

Figure 3. Principle chart of adaptive filtering method for predicting biological function segments

The adaptive filtering method for predicting biological The LMS, RLS and Kalman filter are applied to the
function segments is shown in figure 3. Let the input signal of updating algorithms of the adaptive filter [10]. We choose the
the filter be denoted by x(n) , the output be y (n) , the desire initial parameter: the weight matrix w(0) 0 , the step size
response be d (n) and the error be e(n) . We choose the P 0.001 and the length of the weights L 256 in the LMS
desired signal d (n) in terms of the period behavior of special algorithm; the weight matrix w(0) 0 , the forgetting
biological segments such as period-3 behavior. The error e(n) factor O 0.995 and the length of the weights L 32 in the
between output y (n) and desired response signal d (n) is used RLS algorithm; the weight matrix w(0) 0 , the covariance
to regulate the weights vector of the filter. matrices of the system and measurement noise Q 0.05 ,
Our approach is illustrated as follows. At first, we map R 0.001 , and the length of the weights L 32 in the
the symbolic DNA sequence s(n) into a set of digital signals Kalman filter.
with use of the Voss representation. The four indicator Simulation results show that the period-3 signal can pass
sequences are used as the input signals xK (n) of the adaptive the adaptive filters and the non period-3 signals are eliminated,
filter, K  { A, G, T , C} . The output signals y K (n) are thus, there are some obvious peaks at the locations of
biological function segments with period-3 property and no
obtained. The four output signals all have contribution to the
peaks at other regions. Therefore, the proposed approach can
prediction, so we define the sum output Y (n) as
be applied to identify the interested biological function regions
2 2 2 2
Y ( n) y A ( n )  y G ( n )  yT ( n )  y C ( n ) (3) from real biological sequences with the period property.

IV. EXPERIMENTS
It has been noticed that period-3 property exists within the
exons (coding regions inside the genes) for eukaryotes (cells
with nucleus) and does not exist within the introns (non-
coding regions in the genes) because of coding biases in the
translation of codons into amino acids[11]. In this paper,
period-3 property is applied to find the protein coding
segments (exons) in a DNA sequence. According to the
period-3 property of protein coding regions, the desired signal
is generated by sinusoidal function sin( fk ) with the
frequency f 2S / 3 , k 0, 1, 2, 3, ˜ ˜ ˜ , which is a desired
period-3 signal.
(a)
In the experiment, the computational data is the human £-
globin sequence of 2000 bp (base number 62001-64000,
accession number U01317.1), and downloaded from American
national center of biotechnology information
(www.ncbi.nlm.nih.gov).
TABLE I. HUMAN ‹-GLOBIN EXONS PREDICTION STRUCTURE

Relative to Itself Relative to£-globin


Exon # Start End Start End
1 187 278 62187 62278
2 409 631 62409 62631
3 1482 1610 63482 63610
(b)

1807
corresponding characteristic frequency is used as the desired
period signal.

V. CONCLUSION
The Voss representation is discussed. Secondly, according
to period property of all kinds of biological sequences, a novel
adaptive filtering approach is presented to predict the
biological function segments. Finally, we illustrate this method
by application to identify the protein coding regions from a real
DNA sequences. The predictive location curve of the exons is
obtained by simulation experiments. It is shown that the
presented adaptive filtering approach is valid. We construct a
(c) unified framework of adaptive filtering for genomic and
Figure 4. The predictive curves of the exons by (a) LMS, (b) RLS, (c)
proteomic signal processing. The main advantage of this
Kalman filter method is that it can be easily utilized to identify some
biological function segments by adjusting different desired
period signals, such as period 3 behavior of DNA sequences or
characteristic frequency of protein sequences, but the filters
proposed by literature [3] and [6] need to be redesigned if the
period property and the characteristic frequency of biological
sequences are changed.
Gnomic and proteomic signal processing is a complex
problem, and the identification with period-3 behavior is only a
step towards gene prediction, the method of improving the
performance of the adaptive filtering algorithm will be
investigated in the future research.

REFERENCES
Figure 5. The predictive curves of the exons by the multistage filter [1] I. Shmulevich, E. R. Dougherty, “Genomic Signal Processing,”
Princeton University Press, 2007.
To compare the performance of the adaptive filtering [2] D. Anastassiou, “Genomic signal processing,” IEEE Signal Processing
algorithms, the output signals of the adaptive filters are plotted Magazine, vol. 18, no. 4, 2001, pp.8-20.
in figure 4. Fig. 4 (a) is the output Y(n) of LMS algorithm, (b) [3] P. P. Vaidyanathan, “Genomics and proteomics: a signal processor's
tour,” IEEE circuits and systems magazine, vol.4, no.4, 2004, pp.6-29.
is the output of RLS algorithm and (c) is the output of Kalman
[4] E. Ambikairajah, J. Epps, and M. Akhtar, “Gene and exon prediction
filter. In figure 4(a), the last two exons can be seen and the using time domain algorithms,” Proceedings of the Eighth International
background noise is removed, but the peak of the first exon is Symposium on Signal Processing and Its Applications, vol. 1, 2005,
eliminated also. Compared to the real locations of the exons in pp.199-202.
table I, there exists some error for the second and the third [5] S. W.A.Bergen, and A. Antoniou, “Application of parametric window
functions to the STDFT method for gene prediction,” 2005 IEEE Pacific
exon. Figure 4(b) shows the predictive results of the RLS Rim Conference on Communications, Computers and signal Processing,
adaptive algorithm. The peaks of three exons can be seen 2005, pp. 324-327.
clearly and the background noise is removed largely. [6] P. Ramachandran, A. Antoniou, “Identification of Hot-Spot Locations in
Compared to the locations of the exons in table I, the Proteins Using Digital Filters,” IEEE Journal of Selected Topics in
predictive locations mostly lie in the real fields (the red line in Signal Processing, vol.2, no.3, 2008, pp.378-389.
(b)), which shows the predictive locations are identical to the [7] R. F. Voss, “Evolution of long-range fractal correlations and 1/f noise in
DNA base sequences,” Phy. Rev. Lett., vol. 68, no. 25, 1992, pp.3805-
real positions. Figure 4(c) indicates the predictive results of 3808.
the Kalman filtering algorithm. The peaks of three exons can [8] E.N. Trifonov, “3-, 10.5-, 200- and 400-base periodicities in genome
be seen distinctly and the predictive locations are accurate. sequences,” physica A, vol.249, 1998, pp.511-516.
Figure 5 is the predictive curves by the multistage filter in [9] V. Veljkovi, I. Cosi, B. Dimitrijevi, D. Lalovi, “Is it possible to
literature [3], which indicates that the first exon is not clear. analyze DNA and protein sequences by the methods of digitalsignal
The simulation computation indicates that the proposed processing?,” IEEE Trans. Biomed. Eng., vol. BME-32, no. 5, 1985, pp.
337-341.
adaptive filtering method is useful for gene prediction. There
[10] S. Haykin, Adaptive Filter Theory, Fourth Edition, Prentice Hall, 2002.
are some obvious peaks at the locations of biological function
[11] E.N.Trifonov, and J.L.Sussman, “The pitch of chromatin DNA is
segments with period property and no peaks at other regions. reflected its nucleotide sequence,” Proc. of the Nat. Acad. Sci., USA,
For example, using period-3 signal as desired period signal in 1980, pp.3816-3820.
Fig. 3, the locations of exons can be obtained due to the period-
3 behavior of protein coding regions. The hot-spot locations in
protein sequences can also be predicted when the

1808

You might also like