This action might not be possible to undo. Are you sure you want to continue?
Genes are pieces of DNA, and most genes contain the information for making a specific protein. Bioinformatics: The field of informatics that deals with protein and nucleic acid sequence data. Bioinformatics includes the development of methods to organize and search databases; to analyze DNA and protein sequence data, and to predict protein structure Phenotype - the realized expression of the genotype; the observable expression of a trait which results from the biological activity of proteins or RNA molecules transcribed from the DNA. Domain - A well-defined portion of a protein with its own function. The combination of domains in a single protein determines its overall function. Promoter - a portion of DNA where RNA polymerase attaches to begin transcription. Nucleotide - a single molecule composed of a phosphate, a five carbon sugar, and a nitrogenous base that makes up the sequences of DNA Base pair (bp): Two nitrogenous bases (adenine and thymine or guanine and cytosine) held together by weak bonds. Two strands of DNA are held together in the shape of a double helix by the bonds between base pairs. Genotype - The process by which proteins are made from the instructions encoded in DNA. EST - expressed sequence tag or EST - A short strand of DNA that is a part of a cDNA molecule and can act as identifier of a gene. Used in locating and mapping genes. Motif - A short conserved region in a protein sequence. Motifs are frequently highly conserved parts of domains. A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function Polypeptide chain - a chain of peptides or amino acids (residues). A polypeptide chain usually consists of 100 or fewer amino acids. A protein is made up of one or several polypeptide chains. ligand - A small molecule noncovalently bonded to a larger macromolecule
Sequence tagged site (STS) Short (200 to 500 base pairs) DNA sequence that has a single occurrence in the human genome and whose location and base sequence are known. Detectable by polymerase chain reaction, STSs are useful for localizing and orienting the mapping and sequence data reported from many different laboratories and serve as landmarks on the developing physical map of the human genome. ORF Open Reading Frame: A long DNA sequence that is uninterrupted by a stop codon and encodes part or all of a protein. A potential protein-coding nucleic acid sequence. Exon: The region of a gene that contains the code for producing the gene's protein. Each exon codes for a specific portion of the complete protein. In some species (including humans), a gene's exons are separated by long regions of DNA (called introns or sometimes "junk DNA") that have no apparent function. Intron : A noncoding sequence of DNA that is initially copied into RNA but is cut out of the final RNA transcript. Splicing - the cutting out of introns and joining of exons to form a complete RNA strand with no introns. Locus - The place on a chromosome where a specific gene is located, a kind of address for the gene. The plural is "loci," not "locuses." mRNA - Template for protein synthesis. Each set of three bases, called codons, specifies a certain protein in the sequence of amino acids that comprise the protein. The sequence of a strand of mRNA is based on the sequence of a complementary strand of DNA. PCR- Polymerase Chain Reaction: A fast, inexpensive technique for making an unlimited number of copies of any piece of DNA. Sometimes called "molecular photocopying," PCR has had an immense impact on biology and medicine, especially genetic research. Promoter - The part of a gene that contains the information to turn the gene on or off. The process of transcription is initiated at the promoter. Protease - A protein that digest other proteins.
Single Nucleotide Polymorphisms (SNPs) - Common, but minute, variations that occur in human DNA at a frequency of one every 1,000 bases. These variations can be used to track inheritance in families. Translocation - Breakage and removal of a large segment of DNA from one chromosome, followed by the segment's attachment to a different chromosome. Vector - An agent, such as a virus or a small piece of DNA called a plasmid, that carries a modified or foreign gene. When used in gene therapy, a vector delivers the desired gene to a target cell. Mapping - The process of deducing schematic representations of DNA. Three types of DNA maps can be constructed: physical maps, genetic maps, and cytogenetic maps, with the key distinguishing feature among these three types being the landmarks on which they are based. haploid - The number of chromosomes in a sperm or egg cell, half the diploid number. Marker - Also known as a genetic marker, a segment of DNA with an identifiable physical location on a chromosome whose inheritance can be followed. A marker can be a gene, or it can be some section of DNA with no known function. Because DNA segments that lie near each other on a chromosome tend to be inherited together, markers are often used as indirect ways of tracking the inheritance pattern of genes that have not yet been identified, but whose approximate locations are known. Residue: When two or more amino acids combine to form a peptide, the elements of water are removed, and what remains of each amino acid is called an amino acid residue. C-terminus: The residue that has a free carboxyl group, or at least does not acylate another amino acid residue, (it may, for example, acylate ammonia to give -NH-CHRCO-NH2), is called C-terminal. N-terminus: The residue in a peptide that has an amino group that is free, or at least not acylated by another amino acid residue (it may, for example, be acylated or formylated), is called N-terminal; it is the Nterminus. Also called the amino terminus.
Peptides: Amides derived from two or more amino carboxylic acid molecules (the same or different) by formation of a covalent bond from the carbonyl carbon of one to the nitrogen atom of another with formal loss of water. The term is usually applied to structures formed from a- amino acids, but it includes any amino carboxylic acid. Shorter than proteins. RNA polymerase produces a transcription unit that extends from the promoter to the termination sequences. The gene is defined in reference to the start site - those sequences before the start site are called the upstream sequences, those after the start site are called downstream sequences. The immediate product is the primary transcript. Upstream: The region extending in a 5’ direction from a gene. Downstream: The region extending in a 3’ direction from a gene. Synthesis of the RNA proceeds in the 5' e 3' direction. Several protein transcription factors bind to promoter sites, usually on the 5' side of the gene to be transcribed. The RNA polymerase proceeds down one strand moving in the 3' -> 5' direction. Protein Fold: The core 3D structure of a domain or the overall folding pattern of a 3-D protein or RNA structure Conserved Region (see Motif) A sequence of amino acids in a polypeptide or of nucleotides in DNA or RNA that is similar across multiple species. A known set of conserved sequences is represented by a consensus sequence. Amino acid motifs are often composed of conserved sequences. Markov model > statistical model for probability of each letter depends on predecessors. FASTA database search tool used to compare a nucleotide or peptide sequence to a sequence database. The program is based on the rapid sequence algorithm. A heuristic sequence comparison algorithm for optimum local alignment.
Motif – A discrete portion of a protein assumed to fold independently of the rest of the protein. Common types of motifs: alpha helices or beta sheets. Primary Structure: amino acid sequence of a polypeptide chain. Basic level of the four levels
Conformation (Shape): Proteins function through their conformation. A protein's conformation is usually described in terms of levels of structure. Alpha Helix: certain types of bonding between groups located on the same polypeptide chain cause the backbone to twist into a helix. Beta Sheet: Beta sheets are formed when a polypeptide chain bonds with another chain that is running in the opposite direction and may also be formed between two sections of a single polypeptide chain that is arranged such that adjacent regions are in reverse orientation. Tertiary structure: describes the organization in three dimensions of all of the atoms in the polypeptide. Globular proteins have more compact, often irregular structures. This class of proteins includes most enzymes and most of the proteins involved in gene expression and regulation. Tryptophan. Fibrous proteins have elongated structures, with the polypeptide chains arranged in long strands. This class of proteins serves as major structural components of cells and therefore their role tends to be static in providing a structural framework. Allosteric proteins can change their shape and function depending on the environmental conditions in which they are found. Determine Protein Structure by X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy. Folding motifs are independent folding units, or particular structures, that recur in many molecules. Domains are the building blocks of a protein and are considered elementary units of molecular function. Families are groups of proteins that demonstrate sequence homology, or have similar sequences. Superfamilies consist of proteins that have similar folding motifs, but do not exhibit sequence similarity. An Expressed Sequence Tag (EST) is a tiny portion of an entire gene that can be used to help identify unknown genes and to map their position within a genome. ESTs are
DNA sequences (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and use these "tags" to fish a gene out of a portion of chromosomal DNA by matching base pairs. BLAST: works by comparing a user's unknown sequence against the database of all known sequences to determine likely matches. TATA box: which is found 25-30 nucleotides upstream of the beginning of the gene, referred to as the initiator sequence. ORF: stretches of DNA, usually greater than 100 bases, that are not interrupted by a stop codon such as TAA, TAG or TGA; start codons such as ATG; specific sequences found at splice junctions--a location in the DNA sequence where RNA removes the noncoding areas to form a continuous gene transcript for translation into a protein; and gene regulatory sequences. Local Alignment: The alignment of some portion of two nucleic acid or protein sequences. Alignment: The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Bit Score: An adjusted score assigned to a sequence that accounts for the type of scoring system used. It is calculated from the raw score by normalizing with the statistical variables that define a given scoring system. Therefore, bit scores from different alignments, even those employing different scoring matrices, can be compared. Global alignment: The alignment of two nucleic acid or protein sequences over their entire length. conservation, conserved - Changes at a specific position of an amino acid (or, less commonly, DNA) that preserve the physicochemical properties of the original residue. Searching is done to find relatedness between the query and the entries in the databases
For nucleic acids and proteins, the relatedness is defined by “homology”. A “Query” sequence is used to search against each entry “subject” in a database Two sequences are said to be homologous when they possess sequence identity above a threshold limit Threshold limits can be defined by length, percentage identity, E-value, P-value, Bitscore, r.m.s.d. etc., or a combination of one or more of these, depending on the objective of the search Basic Elements of Searching Biologic DBs: Specificity/selectivity versus Sensitivity Scoring Scheme, Gap penalties Distance/Substitution Matrices (PAM, BLOSSUM Series) Search Parameters (E value, Bit score) Handling Data Quality Issues (Filtering, Clustering) Type of Algorithm (Smithwaterman, Needleman-Waunch) Nucleotide Substitutions: Transitions vs Transversions Transition: A->G or C->T or vice vers Transversions: A->T or A->C or G ->T or G ->C or vice versa Synonymous vs Non-synonymous substitutions Synonymous: CCC - CCG, both code for ‘Proline’ Non-synonymous: UGC - UGG, Cysteine Tryptophan Non-degenerate vs Degenerate sites Non-degenerate: If all possible changes are non-synonymous Ex: AUG (Met) and UGG (Trp) Degenerate: If the substitutions are synonymous 2-fold degenerate: 1 of 3 possible changes are synonymous Ex: Phe (UUU, UUC) 3-fold degenerate: 2 of 3 possible changes are synonymous Ex: Ile (AUU, AUC, AUA) (*only amino acid) 4-fold degenerate: All 3 possible changes are synonymous
Secondary Structure: the folded, coiled, or twisted shape of a polypeptide that results from hydrogen bonding between parts of a molecule. 2 Type of structures: alpha helices and beta pleated sheet. Identifying a protein's shape, or structure, is key to understanding its biological function and role. Amino acids, can be arranged in any order to form a polypeptide that can be thousands of amino acids long. These chains can then loop about each other, or fold, in a variety of ways Shape protein assumes is determined by the specific linear sequence of amino acids from N-terminus (start) -> C-terminus (end). Different amino acids fold inot different 3D shapes. It is theorized that proteins that share a similar sequence generally share the same basic structure. However, 2 sequences may be different but function the same. Positive (+) Charged AA (Basic Aminos): Arg, Hist, Lys Negative (-) Charge AA (Acidic Aminos): Asp, Glu Hydrophilic AA: uncharged uneven charge distribution, these AA can form Hydrogen bonds with water. Found on outer surface of folded proteins, in contact with watery environment of cell: Asn,Cys,Gln,Gly... Hydrophobic AA: uniform charge, does not form hydro bonds; found inside the surface of folded proteins: Trp,Met,Ile...
Ex: Gly (CGU, CGC, CGA, CGG)
BLAST - A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S." The "T" parameter dictates the speed and sensitivity of the search. Core Elements of Alignment Algorithm * Defining the problem (Global, semiglobal, local alignment) * Scoring scheme (Gap penalties) * Distance Matrix (PAM, BLOSUM series) * Scoring/Target function (How scores are calculated) _________________________________ A PAM-x substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced x amount of evolutionary divergence. GAP: A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. BLOSUM: Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid overweighting closely related family members. Blosum: 30, 62, 80 * Built from BLOCKS database * From the most conserved regions of aligned sequences * 2000 blocks from 500 families * Blosum 62 is the most popular. Here, 62 means that, the sequences used in creating the matrix are at least 62% identical * High Blosum - Built from closely related sequences
* Low Blosum - Built from distant sequences Filtering (low complexity) A technique for masking off or removing segments of the query sequence that are repeated or have low compositional complexity, in order to improve the sensitivity of sequence similarity searches performed with that sequence. Filtering can eliminate statistically significant but biologically uninteresting reports from the BLAST, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is applied to the query sequence (or its translation products) only, not to database sequences PAM: Percent Accepted Mutation. A unit that quantifies the amount of evolutionary change in a protein sequence. 1.0 PAM unit is the amount of evolution that will change, on average, 1% of amino acids in a protein sequence. Raw Score: The raw score "S" of the alignment is usually calculated by summing the scores for each letter-to-letter and letterto-null position in the alignment. Scores for each position of an alignment are derived from a substitution matrix, such as the BLOSUM and PAM matrices. Search Parameters E-Value or Expect Value: Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. The E value is the statistical significance threshold for reporting matches against database sequences. The default value is 10. If the statistical significance ascribed to a match is greater than the Expect threshold, the match will not be reported. Lower Expect thresholds are more stringent, leading to fewer chance matches being reported. Increasing the threshold shows less stringent matches. Bit score represents sequence comparisons that is independent of the size of the database. Raw score are normalized to get Bit scores by incorporating information about the scoring scheme used and the size of the search space (size of database)
Differences between PAM & BLOSUM: PAM: +Built from Global Alignments +Built from small amt of data +Counting based on min replacement +Better performance with global alignments +Higher PAM series, more divergent BLOSUM: +Built from local alignments +Built from vast amt of data +Counting based on groups of related sequences counted as one +Better for finding local alignments +Lower BLOSUM series means more divergent Global - When two sequences are of approximately equal length. Here, the goal is to obtain maximum score by completely aligning them Semi-global - When one sequence matches towards one end of the other. Ex. Searches for 5’ or 3’ regulatory sequences Local - When one sequence is a sub-string of the other or the goal is to get maximum local score. Ex. Protein motif searches in a database. Concept: build optimal alignment from previously built optimal alignments of smaller subsequences Three choices: ACCGTxi AGCGTyi align xi and yi = 1 (prefect match) ACCGTxi AGCyi-align xi to gap ACCxi-AGCGTyi align yi to gap
For each cell, add the maximum of a) the value in the cell diagonally above plus a score from a scoring matrix b) the value of the cell to the left minus a gap penalty c) the value of the cell above minus the gap penalty Global and Local Alignment
Blast w/ Small Sequences: + Set Expect value to >= 1000 + Smaller sequences are more likely to occur by chance +Increasing Expect value results in more matches + Decrease word size (W) + BLASTN needs W >= 7 + Query length >= 2W + The smaller W, the slower the search + Turn filter option - OFF + Change the scoring matrix
Understanding BLAST Since you will probably use BLAST more than any other sequence analysis software, lets learn a little bit more about what we just did, how it was achieved, and what else we can do with BLAST. If you return to the Swiss EMBnet BLAST page and hit the button labelled "Advanced BLAST", you will be presented with many more options. The optimal options for most searches are set by default, but you can easily change them for your BLAST query. Before we learn about a few of the most important options, we need to understand the basic concept behind how the search is done. Scoring: Finding the Best Alignments The BLAST programs attempt to align short regions of your sequence with regions of sequences in the database. The initial scanning phase identifies matching [query:database] fragments. A match is determined by the sum alignment score for a region (defined as a "word") of the query sequence. The alignment for each base in the word is scored: if a nucleotide in the query word exactly matches a nucleotide at the same position in the database word (e.g. A with A), then a positive score is awarded. If the match is good but not perfect (e.g. W in the query with A or T in the database) then a lower score is awarded. If two nucleotides do not match, a negative score is awared. The sum score is used to determine the degree of similarity. Sequences with a high score are referred to as high-scoring segment pairs (HSPs). The program tries to extend the best HSP (those with the highest score; the best matches) by extending the alignment in both directions. The alignment extension is continued until the sequence ends, or the alignment becomes non-biologically significant. Substitution matrices are used during both scanning and extension. The reported sequences are those with the overall highest scores (maximal-scoring segment pair, MSP). The WORDLENGTH Option The length of the initial word identified is specified by the value W (WORDLENGTH). BLAST only attempts to extend aligned fragments that are a perfect match for W continuous nucleotides. The default is 11 for blastn; by default, blastn will scan the database until it finds words that are 11 letters long that exactly match an 11 letter
word in the query. These will be extended. A WORDLENGTH of eleven is sufficient to exclude even moderately diverged homologs, and therefore also excludes almost all chance alignments. The Filter Option BLAST version 2.0 enables the application of a filter. The filter masks regions of the query sequence that have low compositional complexity (e.g. Alu sequences), as determined by computer programs (SEG or XNU); (References). Masking is achieved by replacing the sequence with a string of N's (NNNNNN). N is the IUB code for any DNA base. Only the query sequence is masked. The sequences in the database will not be masked. Filtering is a good idea for almost all sequences, and is turned on by default. PolyA tails and proline rich sequences, for example, can return artificially high scores and therefore misleading results. This is due to the large numbers of such sequences, dispersed throughout the genome, and therefore also throughout the database. The Matrix Option As discussed above, significance of a match is determined by the returned score. The score reflects the probability that this match would not have been found by chance. The method for calculating and optimising the score is the essential difference between the programs and matrices. Statistical matrices are used both to identify sequences in a database, and to predict the biological significance of the match. There are several to choose from, but you should usually accept the matrix recommended by the program you are using. To understand when to use different matrices, you should understand the matrices and how they work. If you wish, you can review the literature (References), or refer to Keith Robisons homepage. Briefly, there are two main types of substitution matrices that are frequently used by programs such as the BLAST family. Substitution matrices function to give a score to the alignment of each pair of residues. This isn't as simple as it may seem: remember that IUB ambiguity codes can be included in the sequence, and biological significance is the goal. For example: a mutation may result in lysine being translated into a protein, instead of arginine. This may be a conservative mutation, as the function of the protein is unlikely to change.
Common Substitution Matrix Families PAM (Percent Accepted Mutation) PAM matrices are most sensitive for alignments of sequences with evolutionary related homologs. The greater the number in the matrix name (e.g. PAM40, PAM120), the greater the expected evolutionary (mutational) distance. You should choose the appropriate matrix for an optimal search. If the mutational distance is unknown, you should run at least three searches using PAM40, PAM120 and PAM250 matrices. You may choose to use PAM to identify conserved sequences or features therein, or to establish the evolution of a sequence. BLOSSUM (Blocks Substitution Matrix) BLOSSUM matrices are most sensitive for local alignment of related sequences. The BLOSUM matrices are therefore ideal when tying to identify an unknown nucleotide sequence. BLOSUM62 is a good general matrix, set by default for protein BLAST searches. You should use BLOSUM62 (set by default) for protein BLAST searches i.e. BLASTP, BLASTX, TBLASTN and TBLASTX. BLOSUM62 is optimised for general BLAST searches, and is suitable for most situations; it will recognise some amino acid substitutions as conservative (e.g. Arg to Lys). If you are searching for evolutionary related proteins, you should use PAM120 for generalised similarity searches. Take care! You cannot compare the alignment scores (see later) from one matrix directly against the alignment scores from another matrix! You can choose an alternative scoring matrix for BLASTP, BLASTX, TBLASTN or TBLASTX. You can choose between PAM30, PAM70, BLOSUM80, BLOSUM45, or the default BLOSUM62. You cannot choose a matrix for BLASTN searches (instead, specify M and N, discussed below). The EXPECT Option You may, for example wish to set an expected score threshold (EXPECT) for the search, set to 10 by default. This means that ten matches are expected to be found by chance. If the statistical significance of a match is greater than the expected score threshold, it is not reported. Only if the statistical significance is less than this level, will the match be reported. In other words, a lower EXPECT threshold applies a more stringent search. This leads to fewer chance alignments being reported. You can enter fractional values if you wish; values are often suggested in a menu. The Score Value Options At the top of this page, we learnt that a [query:database] nucleotide pair was rewarded with a score depending on whether the nucleotides at that position were identical or not. The score awarded can be set by the user. M Parameter The score awarded when a pair of aligned residues match. Must be a positive integer. N Parameter The score awarded when a pair of aligned residues do not match. Must be a negative integer. The ratio of M:N determines the degree of divergence (evolution) that is accepted. The default value for M is 5 and for N is 4. The ratio of 1.25 equates to around 47 nucleic acid point accepted mutations (PAMs) per 100 residues. PAMs are used as a predictor of the degrees of evolution from an ancestor (in molecular terms). If you adjust the M and N values to give a higher ratio, more nucleic acid PAMs will be accepted by the algorithm, resulting in a more divergent search. Fetching Sequences Finally, you may have noticed that you don't need to enter a nucleotide sequence when performing a BLAST alignment. You can enter the ID number (EMBL entries) or accession number (GenBank) instead! This method allows you to fetch a sequence from the database, without performing an alignment. There may be several reasons you want to obtain the latest accurate version of a sequence. You may be optimising a PCR, and need to confirm the quality of your PCR reaction by sequencing your PCR product, then comparing your sequence against the published sequence in a database, for example. There are yet more options, but they require a more complete understanding of the search algorithm to be applied successfully. I suggest that you read the manual pages at a BLAST server site for more information.
Understanding FASTA The FASTA algorithm and family of programs are similar to BLAST in that they both align a query sequence against all of the sequences in a database and return the most significant matches. Whereas BLAST relies on the sum match probability for each local alignment for the sequence, FASTA scores only exact matches. FASTA allows gapped searches to be made. Like BLAST, FASTA is heuristic, sacrificing some speed for sensitivity. FASTA comes in several flavours, and you should choose the most appropriate program when searching. fasta3 A DNA query sequence is aligned against a DNA sequence database. A protein query sequence will be aligned against a protein database. tfasta3 Align a protein query sequence against a DNA sequence database, translating the DNA sequences 'on-the-fly'. fastx3 Align a DNA query sequence against a protein sequence database, comparing the translated DNA sequence in three frames. tfastx3 Align a protein sequence to a DNA sequence database. Align with forward and reverse frameshift mutations. ssearch Align a protein or DNA sequence to a sequence database using the SmithWaterman algorithm. FASTA Options Sequences as short as 10 nucleotides in length can be queried using FASTA. The speed of the alignment is largely determined by the KTUP value, which is used to limit the word length. You may recall from the BLAST pages, that a "word" is a short region of the query sequence that is compared against the database. In BLAST, words with the highest alignment score are selected for the extension phase. In FASTA, the word is not scored, but must be an exact match if it is to be processed further. FASTA Matricies When FASTA identifies an exact match, it uses a substitution matrix on the word to determine the optimal alignment of the query sequence against the identified database
sequence. The matrix is used to score the process of aligning the sequence flanking the exact match. The main difference between the initial scanning and this extension phase, is that the substitution matrix (here only used for extension, unlike BLAST) allows the IUB ambiguity codes to be included and scored. Finally, FASTA attempts to join continous regions of alignments. Most FASTA servers offer a choice of the BLOSSUM or PAM family of substitution matricies (as for BLAST). For FASTA, the default and recommended general purpose matrix is BLOSUM50. BLOSUM62 is also a good choice, if you find that the BLOSUSUM50 is unsuitable. If you are searching for mutationally (evolutionary) distanced sequences, try PAM. The Smith-Waterman Algorithm You may have the option of using another algorithm that is more rigorous than those we have discussed to now. The SmithWaterman algorithm is employed by the ssearch3 program distributed with FASTA, and should be used if you need a highly refined search. It is much slower to use than either BLAST or FASTA programs. You may find the TimeLogic server is fast for Smtih-Waterman searches, but you will need to register to use the site. Registration was free at the time of writing, and came with benefits including access to other specialist databases. This guide will not use the TimeLogic server. You will usually be required to enter an email address for the results to be returned to you when they are ready. The SmithWaterman algorithm is highly computationally intensive, and therefore very slow, but it is undoubtedly worth using if you find you need to.