## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Vijay Krishnan Masters Student Computer Science Department

Repetitive DNA

Refers to substrings of the genome that repeat multiple times. Different instances of the repeat element can have slightly different patterns Highly prevalent in eukaryotes (organisms with a visible nucleus and cell structure, as opposed to bacteria)

About 50% of the human genome is repetitive DNA.

2

**Why detect repetitive DNA?
**

Repeats Drive Evolution in Diverse Ways (Kazazian, 2004). Repetitive DNA are generally not found to have any function. Homology searches need repeat masking.

To avoid explosion of unnecessary results.

Repeats also contain information about parentage.

3

Endpoints of Q referred to as start(Q) and end(Q). Completely defined by the endpoint coordinates of Q and T. Q = partner(T) with respect to the hit. Q and T are called images of the hit. 4 .Hit Defined as a local alignment between two regions Q and T.

A3} 5 . A2} Signature induced by a Dispersed Family. Images(y) = {A1. A3} Images(z) = {A2. Images(x) = {A1.Dispersed Families (DF) Often comprise mobile elements like Transposons and Retrotransposons.

Tandem Arrays (TA) The repeating element is called a ³Satellite´. 6 . ³Pyramidal´ Signature Induced by a Tandem Array.

separated by 50 to 15000 bases.Other Repeat Families Pseudo-Satellites: Intermediate between Satellites and Dispersed Families. 7 . The PILER paper defines it as images with size 50-2000 bases. Tandem Repeat: Often defined to be the same as TA.

De novo identification of repeat families Input: The Genome sequence Output: The repeat families and the positions where they occur in the Genome. 8 .

PILER: identification and classification of genomic repeats Robert C. Edgar and Eugene W. Myers .

Finding regions separated by maximum distance . 10 . Used to find local alignments of minimum length( ) and minimum identity( ). Additional optimizations for banded search for alignments.Finding Local Alignments (Hits) Pairwise Alignment of Local Sequences (PALS) software used as a black box.

[8. [8. [3. This corresponds to 2N images (intervals). [2.6]}. Pile Images = { {[1. ³Merge´ overlapping images and ³erase´ the boundaries between adjacent images.6]. [9.9].4]. [9.13] }. Let images = { [1.9]. [3. A pile is a list of all images covering a maximal contiguous region.3]. [2.13]} } 11 .6].13] } Pile boundaries = { [1.Pile Suppose we are given a list of N hits.4]. {[8.3].

4]. [2. [6.Construction of Piles (Example) Images = { [1.7] } Index Value Index Value 1 2 3 4 5 6 7 0 0 0 0 0 0 0 1 2 3 4 5 6 7 1 2 2 1 0 0 0 Index Value Index Value Index Value 1 2 3 4 5 6 7 1 1 1 0 0 0 0 1 2 3 4 5 6 7 1 2 2 1 0 1 1 1 2 3 4 5 6 7 1 1 1 1 0 2 2 12 .3].

t >= 3 to avoid segmental duplication.PILER-DF Let G be a graph with one node for each pile. 13 . Each Connected Component is a DF. is-global-image(Q) is true if: #bases in Q >= g * (#bases in pile(Q)) For each pile p in P: For each image Q in p: Let T = partner(Q) if is-global-image(Q) and is-global-image(T ): ± Add edge pípile(T ) to G Find connected components of G of order t. and no edges.

Banded Search: Ensures that the PSs are clustered.PILER-PS Similar to the problem of finding DFs. Allows a faster and more sensitive search for hits. except that PSs are typically closer to one another. Algorithm identical to PILER-DF except for banded search to identify hits. 14 .

Define first(h) = image in h with smaller start coordinate. The images should be separated by at most distance (banded search). We can avoid comparing every pair of hits since: Hits in a pyramid belong to the same pile.PILER-TA TAs have pyramids as signatures. Define last(h) = image in h with larger start coordinate. 15 .

B3.B3) Set T1 = last(h1) « here (B2. 16 .B2.B4) Set Q2= first(h2) « here (B1.B2) Set T2 = last(h2) « here (B3. 0 <=m <= 1. |h2|) Set Q1 = first(h1) « here (B1.5 and ± |dS| < m and |dT | < m: ± Add edge h1 í h2 to G Each connected components of G is a TA. Here 4-4 = 0 if shorter_length / longer_length > 0.05.PILER-TA For each pile p: Create an empty graph G with all hits in the pile For each pair of hits (h1. h2) in p: Set shorter_length = min(|h1|.B4) Set dS = (start(Q2) í start(Q1)) / shorter_length « here 0-0=0 Set dE = (end(T2) í end(T1)) / shorter_length «. By default m= 0. |h2|) Set longer_length = max(|h1|.

Two pass method: Pass1: perform banded search for TR candidates. 17 . Pass2: Find hits that align TR pairs to each other.PILER-TR Identify and mask Satellites and PSs.

This library can be used by BLAST or RepeatMasker to find intact and partial instances. Use these to find consensus sequences.Library Construction Use MUSCLE (Edgar. 18 .b) Create multiple alignments of family members found by PILER. 2004a.

thalania 19 .Satellites and PSs in A.

De novo identification of repeat families in large genomes Alkes L. Jones Pavel A. Pevzner . Price Neil C.

.. 2002).. Output: Substrings R1.. 21 . Builds repeat families using high-frequency L-mers as seeds.«. and consensus sequence Q.The RepeatScout Algorithm Improves on the RECON algorithm (Bao and Eddy.Sn each of which contains a similar repeat element and extends past the repeat element on either side.«.Rn that give the repeat element boundaries. Input: DNA Sequences S1.

The penalty factor c|Q| discourages long Qs.Sk) = [ k max{a(Q...S1.0}] -c|Q|.Sk).. c can be thought of as the minimum number of repeat elements that must align with each given position of Q.. 22 . Where a(Q.RepeatScout (contd) Q is defined to be the sequence that maximizes: A(Q.Sk) can be any reasonable sequence alignment score.

Choice of a(Q. Strict constraint on Q. 23 . 1995) Boundaries of Q shared by all segments.S) Local Alignment Score: Fit Alignment Score (Waterman.

Fit-Preferred Alignment Score 24 .

Comparison of Alignment Scores 25 .

. Sn) Even dynamic Programming for the optimal solution is intractable. . . . . Greedy Heuristic: Suppose L is the high freqency lmer and S1. . .Optimizing A(Q. Initialize Q0 to L and greedily extend Q. . S1. The problem would be n-dimensional. Both time and space requirements are exponential in n. 26 . Sn surround its exact matches.

N where N maximizes: A(Qt . . of iterations gives no improvement. T} Choose Qt+1 =Qt . . S1. . . 27 . . Sn) N {A. . G. . C. and then to the left.Optimizing A(Q. Use this procedure for extending to the right. .N. S1. Sn) We can re-use alignment scores from the previous iteration while computing alignment scores for the (t+1)th iteration. Terminate after a certain no.

Optimizing A(Q. . Algorithm terminates when we have no L-mers with effective count of at least m. . locate its occurrences and reduce the counts of L-mers corresponding to those locations. . . Refine Q after the optimal alignment boundaries are determined. Sn) Prevent redundancy in finding consensus sequences. After identifying Q. More details of parameter settings in the paper. S1. 28 .

Results 29 .

Results 30 .

Results 31 .

32 .Conclusions Both PILER and RepeatScout address DNA repeats. PILER focuses more on finding diverse kinds of repeat families and uses MUSCLE to find the consensus sequences RepeatScout focuses more on finding the consensus sequence given members of a repeat family.

Thank You! Questions? 33 .

- tmpB444.tmp
- Salient
- 2009-test-1-reveiw
- Lesson 18 Special mportation Problems
- St Eh Let He Sis Abstract
- Optimizing the Unevenness in Production Scheduling Through Mathematical Approach a Case Study
- manual.pdf
- Plant Physiol. 1973 Salamon 635 40
- CIF
- Calculus- Final Review Set
- PlantESP Loop Performance Monitoring-Aquarius
- chap8
- Write Up Data Mining
- Christopher Taylor Thesis
- OPTEC Annealing Anoop Malgorzata
- solving inverse problem by genetic algorithm
- 171901(12.12.15)
- 001_MSM_End_Term_Paper_26-Oct-2015_10-53-22[1]
- Productivity Improvement by Sa and Ga Based Multi-objective Optimization in Cnc Machining
- Maths Remediation
- Ting Siew Hoo-Journal
- Chap09[2] LS Modif
- Vidal 1984
- WCDMA RNO RF Optimization
- fqpp p5
- Cat Algorethmic
- Cheat Sheet Spring
- Lecture2 LPIntro Student 2011
- Planning of Supply when lead time uncertain
- Cost Optimization of Industrial Steel Building Structures

- UT Dallas Syllabus for math6343.501 06f taught by Mieczyslaw Dabkowski (mkd034000)
- tmpD93C.tmp
- A survey on Design and Implementation of Clever Crawler Based On DUST Removal
- tmpED6F
- Commentz-Walter
- tmpF054
- tmp42B.tmp
- UT Dallas Syllabus for biol4375.001 06f taught by Stephen Levene (sdlevene)
- tmp5002.tmp
- UT Dallas Syllabus for biol4375.001 05f taught by Stephen Levene (sdlevene)
- tmpF80E
- UT Dallas Syllabus for biol5376.001.11f taught by Zhenyu Xuan (zxx091000, sdlevene)
- A Comparison of Computation Techniques for DNA Sequence Comparison
- Tmp 1387
- Identification and Prevention of Masquerade Attack using DDSGA Algorithm
- tmpAC86.tmp
- UT Dallas Syllabus for biol5376.001.07f taught by Stephen Levene (sdlevene)

Sign up to vote on this title

UsefulNot usefulClose Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Close Dialog## This title now requires a credit

Use one of your book credits to continue reading from where you left off, or restart the preview.

Loading