Professional Documents
Culture Documents
(1)Medical School and (2)Department of Computer Science, University of Minnesota, Minneapolis MN;
(3)DOE Plant Research Laboratory, Michigan State University, East Lansing, MI
email: comments@lenti.med.umn.edu
Table of Contents
1.0 Summary
2.0 Searching with DNA or Protein Motifs
2.1 Simple searches
2.2 More complex searches
2.3 Future additions
3.0 Searching with longer DNA Sequences
3.1 How to search
3.2 Future additions
4.0 Searching with Key Words
5.0 Acknowledgments
6.0 References
1.0 Summary
Several programs have been developed at the University of Minnesota's Plant Molecular Informatics Center to aid in the molecular investigation of Arabidopsis. Expressed Sequence Tags (ESTs) provided by Dr. Tom Newman at
Michigan State University (Newman et al., 1994) are pre-processed by automatically searching for and removing vector, as well as trimming off sequence that has a high concentration of ambiguous bases (Bieganski, 1995;
Shoop et al., 1995). These sequences are then input into databases, which in turn have been exploited to develop software for the plant community. The various software has been designed to allow you to search quickly through
enormous amounts of similarity data and DNA sequences to arrive at a set of sequences that may interest you. Regardless of your research interests and approaches, we likely have a search mechanism for you! These tools
include:
Motif Explorer--a locally developed tool for searching Arabidopsis ESTs or PIR sequences for the presence of your short DNA or protein motif;
EST databases for plants--for searching Arabidopsis ESTs or other plant cDNA for similarity to the DNA sequence of your choice;
a WAIS index search--for searching exhaustive reports detailing the similarity of Arabidopsis ESTs to sequences in GenBank, GenInfo and PIR.
The purpose of this brief electronic tutorial is to highlight proper search strategies for all three tools, and to provide detailed examples for searching with Motif Explorer and utilizing the plant EST database. These tools are
available via our World Wide Web server [http://lenti.med.umn.edu]. Since this tutorial is designed to be used on-line, you may want to open multiple windows in your browser. This way you can have this article in one window
and one of our tools in a second window, allowing for experimentation as you read. As you explore, keep in mind that most of the Arabidopsis sequences have been submitted to dbEST (Boguski et al., 1993 ) at NCBI, and all of
the Arabidopsis clones, from which the EST sequences are derived, are made available at the ABRC Stock Center at Ohio State University.
Motif Explorer is available on our server under the Arabidopsis cDNA project and you can operate it by completing a simple Web form (Figure 1). Three steps are necessary to complete the form. First, click on the "database"
button to choose the Michigan State University Arabidopsis database (a DNA database), or a protein database consisting of PIR sequences. Similarly, click on the "pattern alphabet" button to tell the program to search using
nucleotides or amino acids. Third, you must enter the sequence you wish to locate--make sure this corresponds to your alphabet selection. There are other parameters you may set (Section 2.2); we recommend that for your first
few searches, you keep the default settings.
If you want to find ELVIS in PIR (alas, we could not find him in Arabidopsis), fill out the form as shown in (Figure 1) and press the "search" button to receive a report similar to the one shown in (Figure 2). At the top of this
report, you are reminded of the pattern, or motif, for which you searched. A total of the matches found is the next item in the report, followed by the unique matching fragment summary. Note that the search was actually
conducted on ELVIS? not ELVIS, and the question mark serves as a wild card. That is, any amino acid (or base, for DNA searches) is acceptable. Since there are 20 amino acids, there could be as many as 20 unique fragments
that are found, and any of these could be repeated any number of times. After this summary is a list of all the sequences in which the query was found. The query motif is shown in pointy brackets and a short piece of surrounding
sequence is also displayed, as it may provide additional insight for your search. A list of these sequences, along with links to the PIR or Arabidopsis database, is provided at the end of this report. By clicking on these links you
will find out detailed information about the sequences that contain your search pattern. If you search using the PIR database, the resulting links go directly to the Entrez document server, and if you search using the Arabidopsis
database, the links go to an html document created at the University of Minnesota that contains detailed information regarding the similarity of that sequence to sequences from other public databases. These are the same reports
that have been WAIS indexed for key word searching and they are discussed below (Section 4.0).
CAT-[CG]-x-GTG-{CG}-x(2,6)-GTA
The results from using this search on the Arabidopsis database include six fragments of varying length. The hyphens are not essential; they merely aid in grouping.
PROSITE conventions also allow searching for patterns at the beginning or end of a sequence using left or right pointy brackets ("<" or ">"), respectively. For example, <A-x-[ST](2)-x(0,1)-V will list only matches to the N-
terminal of the sequence, and G[ACG]GATC> will list matches occurring at the 3' end of the DNA sequence. Motif Explorer works using the strategy of finding whether or not a motif is contained within a sequence before it
locates the position of the subsequence. (This strategy enables increased speed in searching.) As a result, the "Unique matching fragment summary" lists all unique motifs found anywhere in the database sequences, while the
"Matches" section only lists those sequences that have hits to the appropriate sequence location. For example, the query
G[ACG]GATC>
locates over 3,700 fragments matching the patterns GAGATC, GCGATC, and GGGATC, but the report indicates that only three examples from the latter two patterns are found at the 3' end of the sequence.
There is a slight oddity you may notice when you run a search using the left pointy bracket, "<". This symbol is used by Motif Explorer to designate the beginning of a sequence, and it is also has meaning for HTML (the
language we use to make Motif Explorer available to you on the server). Due to this double use of "<", the "Query expression" line on the report will show up blank, but this is not cause for concern, as the search and its results
are not affected.
Additional searching power is gained using regular expressions, a pattern searching technique that is commonly used in computer science. The ? (question mark) character, the | (vertical bar, or "pipe") operator, and parenthesis
are used along with the letters of the DNA (or protein) alphabet to make these expressions. The ? (question mark) character allows for any base (or amino acid) to be accepted, just as with the "x" in PROSITE conventions, and
the | (vertical bar, or "pipe") operator acts as an "or statement". For example, if you were looking for EcoRI or AlwnI restriction enzyme digestion sites, you could enter the following search:
(GAATTC)|(CAG???CTG)
to find matches to GAATTC, or any string of bases that starts with CAG, ends with CTG and has any 3 bases in between. Parenthesis can also be used to nest the expressions. By searching on AC((AG)|(CC))T, you will match
sequences ACAGT and ACCCT.
https://www.arabidopsis.org/weedsworld/Vol2iii/mX_blastn_tutorial.html 1/2
4/15/2018 Does ELVIS (GA[AG]CT[ACGT]GT[ACGT]AT[ACT]TC[ACGT]TA[AG]) Live in Minnesota ('s Arabidopsis Databank)?
The more complex queries may require that you change some of the other parameters on the form. Many searches will get completed using the default "very fast" search time. Sometimes your search will take longer, and the 20
second time limit is appropriate. Chances are, if you need to conduct a search that has no time limit, the results you obtain are likely not to be useful to your work. If you "uncheck", or turn off, the "Report unique matching
fragments" box, then the report will not show the "Unique matching fragment summary" section, which is missing in Figure 3 (compare with Figure 2, which has this section). This section is most useful when your search may
find more than one distinct pattern. By "unchecking" the "Report matching sequences", you will not get the list of matches (Figure 4; compare with Figure 2). While this step drastically limits the amount of information you will
receive in the report, it does speed up the query time, and is useful when you are testing out the prevalence of your search pattern.
If you have sequenced a gene from Arabidopsis or another plant, you will be interested in the plant EST databases that we have created by compiling Arabidopsis, maize, rice, and loblolly pine sequences. Using the form provided
(a completed example is shown in (Figure 5)), you may analyze your DNA sequence for similarity (using the BLASTN algorithm; Altschul, et al., 1990) against a single genome or all of the plant species we have available. The
advantage of this tool is that you can compare your plant sequence against a large number of other plant cDNA without having to sift through all the matches in other public databases, which are comprised mainly of animal cell,
bacterial, and other non-plant DNA. In addition, the BLASTN analysis is performed essentially instantaneously, without your needing to format a special email. Simply choose your database from the list provided, enter the
sequence you are studying (if you already have your sequence in the computer, be sure to copy and paste it in the form to avoid typing errors!), and press the "submit your query" button. Your sequence will then be blasted against
all the cDNA we have available in the database you chose.
The reports that you see when the search is completed are easy to interpret. They start and end with a listing of the database you used for your search and the sequence you analyzed. In between, are first a list of the sequences
producing high-scoring alignments, and second, the actual alignments with score, p-value and % identity information. Searches investigating the similarity of the Arabidopsis EST 169H15T7 (a probable catalase, shown in Figure
5) to the maize (corn) (maize summary) and the entire plant database (plant summary) were conducted for your perusal. Note that this sequence has a high similarity to one maize EST, which is more easily seen in the corn-only
database, and it also appears in the plants database, along with many good matches to both Arabidopsis and rice ESTs.
There are only a few points to remember when using this program. While the minimum sequence length is 12 nucleotides, this program is intended for use with sequences that are about 50 bases or longer. Five bases are
acceptable in your query sequence: G, A, T, C, or N, if there is an ambiguity. Also, keep in mind that your blast must be conducted on a "fixed" sequence. That is, if you want to look at two sequences that are slightly different,
you must run two run the BLASTN program twice, using a different sequence each time. If you are interested in learning more about the subject sequence, or the sequence that your query sequence matched, you can enter its
clone id into the WAIS index search that we have established (described below, Section 4.0).
Regardless of your search preference, we wish you happy hunting! As you peruse our server, please be sure to complete the registration form, so you will be notified of changes, updates and new additions. If you have any
questions or comments about this tutorial or any of the tools we have available on the web, please email us at comments@lenti.med.umn.edu and one of us will get back to you.
5.0 Acknowledgments
The University of Minnesota Plant Data Acquisition, Analysis and Distribution Project is funded under NSF Grant BIR 940-2380, and the Michigan State University DOE Plant Research Laboratory Arabidopsis cDNA
Sequencing Project is funded under NSF Grant BIR 931-3751. In addition, we are supported by major resources from the following: The University of Minnesota Medical School, with special thanks to Dean Frank Cerra;
Computing and Informations Services, with special thanks to Professor and Vice President Don Riley; University Networking Services, with special thanks to Director Larry Dunn; Sun Microsystems, with special thanks to
Sandra Swenson; IBM, with special thanks to Norm Troullie and Pat Carey; and Cray Research, Inc., with special thanks to John Carpenter and Bill King.
We would also like to give our thanks to Mary Anderson of the Nottingham Arabidopsis Stock Centre, and Carolyn Tolstoshev, Mark Boguski and Jane Weisemann of NCBI's dbEST; their encouragement and assistance in this
work has been very important to us.
6.0 References
S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D. J. Lipman. 1990. "Basic Alignment Search Tool", Journal of Molecular Biology, 215:403-410.
A. Bairoch. 1995. "PROSITE: A Dictionary of Protein Sites and Patterns: User Manual", Release 13.0, available at [ http://expasy.hcuge.ch/sprot/prosite.html].
P. Bieganski. 1995. "Genetic Sequence Data Retrieval and Manipulation based on Generalized Suffix Trees", Ph.D. Thesis, University of Minnesota, Minneapolis, MN.
P. Bieganski, J. Riedl, J.V. Carlis and E.F. Retzel. 1996. "High-Performance Interactive Exploration of Amino Acid Sequence Motifs", Pacific Symposium on Biocomputing, Hawaii. Accepted.
P. Bieganski, J. Riedl, J.V. Carlis and E.F. Retzel. 1994. "Generalized Suffix Trees for Biological Sequence Data: Applications and Implementation" In: Proceedings of the IEEE 27th Hawaii International Conference on System
Sciences. Oahu, Hawaii. L. Shriver and L. Hunter, (Eds.). IEEE Computer Society Press. V:35-44.
M.S. Boguski, T.M.J. Lowe, and C.M. Tolstoshev. 1993. "dbest - database for expressed sequence tags." Nature Genetics, 4:332-333.
T. Newman, F. de Bruijn, P. Green, K. Keegstra, H. Kende, L. McIntosh, J. Ohlrogge, N. Raikhel, S. Somerville, M. Thomashow, E.F. Retzel and C. Somerville. 1994. "Genes Galore: A Summary of Methods for Accessing
Results from Large-Scale Partial Sequencing of Anonymous Arabidopsis cDNA Clones." Plant Physiology. 106:1241-1255.
E. Shoop, E. Chi, J.V. Carlis, P. Bieganski, J. Riedl, N. Dalton, T. Newman and E.F. Retzel.1995. "Implementation and Testing of an Automated EST Processing and Similarity Ana lysis System." In: Proceedings of the IEEE 28th
Annual International Conference on System Sciences. Maui, Hawaii. L. Shriver and L. Hunter, (Eds.). IEEE Computer Society Press. 5:52-61.
K.L. Swope, T.C. Newman, E. Shoop, P. Bieganski, E. Chi, O. Holt, J. Carlis, J. Riedl and E.F. Retzel. 1995. "Everything you wanted to know about the University of Minnesota's analysis of Arabidopsis ESTs but were afraid to
ask." Weeds World: The International Electronic Arabidopsis Newsletter. 2(ii):21-26. Available at [http://nasc.life.nott.ac. uk:8300/]
https://www.arabidopsis.org/weedsworld/Vol2iii/mX_blastn_tutorial.html 2/2