You are on page 1of 2

1. ~Appraise bioinformatics, explain different databases system, state 3 applications of bioinformatics.

1. ~Establish the features and objectives of a biological database


Information contained in biological databases includes gene function, structure, localization (both cellular &
Bioinformatics is an interdisciplinary field that combines biology, computer science, and mathematics to analyze chromosomal), clinical effects of mutations as well as similarities of biological sequences & structures.
and interpret biological data. It involves the development and application of software tools and databases to store,
retrieve, and analyze large volumes of biological and genetic information.
4 objectives of biological databases:
Databases Systems: - To make all relevant data available at one place.
1. Genomic Databases: GenBank, Ensembl, NCBI Genome. - To store all relevant information easily.
2. Protein Databases: UniProt, PDB, Pfam.
- To make biological data available to the scientist.
3. Sequence Databases: BLAST, FASTA.
- To update existing information easily.
4. Gene Expression Databases: GEO, Array Express. -
5. Phylogenetic Databases: Tree of Life, NCBI 2. ~Explain the significance of biological databases, focusing on sequence data. Provide examples
Taxonomy. Applications: of databases that store sequence data and discuss how researchers can utilize these
1. Genomic Analysis: Identifying genes, regulatory elements, and variations for genetic diseases resources.
and evolutionary studies.
2. Proteomics and Structural Biology: Predicting protein structures, analyzing interactions, and aiding
Biological databases, particularly those storing sequence data, are essential for researchers to access and analyze
drug discovery.
genetic information. For example, the NCBI (National Centre for Biotechnology Information) database contains vast
3. Clinical Genomics: Analyzing patient DNA for diagnosis, risk assessment, and personalized medicine.
amounts of DNA and protein sequence data, as well as associated metadata.
In summary, bioinformatics utilizes diverse databases and applications to analyze biological data, offering insights into
genomics, proteomics, and personalized healthcare. Researchers can utilize these databases to perform sequence alignments, identify genes, study genetic variation,
and conduct phylogenetic analyses. These resources enable scientists to explore genetic relationships,
1. ~Describe the steps for BLAST and classify its different types. investigate functional elements, and contribute to advances in fields like genomics and proteomics.
Main steps of BLAST are:
Step 1: The first step is to create a lookup table or list of words from the query sequence. This step is also called 3. ~Differentiate between primary, secondary and composite databases with examples of each.
seeding. BLAST takes the query sequence and breaks it into short segments called words. Primary databases store and make data available to the public, acting as repositories. Example: GenBANK, DDBJ
Step 2: Search database for exact matching with the list of words complied in Step 1. Secondary databases make use of publicly available sequence data in primary databases to provide layers of
Step 3: BLAST then scores the similarity of the matching words. The matching of the words is scored by a given information to DNA or protein sequence data. Example: UniProt Knowledgebase.
substitution matrix. Composite databases are meant for keeping records of specific datasets meant for specific purpose and
Step 4: Evaluating significance of extended hits from step 3. applications. Example: OMIM

There are five types of BLAST that are differentiated based on the type of sequence (DNA or protein) of the query 4. ~Infer Global alignment.
and database sequences. They are: BLASTN, BLASTP, BLASTX, TBLASTN and TBLASTX. Global alignment is a method of comparing two sequences, which aligns the entire length of the sequences by
2. Estimate the characteristics and the applications of BLAST. maximizing the overall similarity. This method is used when comparing sequences that are of the same length.
Several key features of BLAST make it a widely used tool in bioinformatics. Global alignment is based on Needleman-Wunsch alignment. In global alignment Sequence to be aligned
- BLAST is fast and efficient, making it possible to handle large databases of sequences. assume to be genetically similar over there entire length. Alignment is carried out from beginning to end of both
- It is a flexible and versatile tool as it can be used to search for similarities in both nucleotide and sequences to find the best possible alignment across the entire length between the sequences. The two
protein sequences. sequences are treated as potentially equivalent.
- It is highly sensitive which allows the identification of even small similarities between sequences.
- It aims to identify regions of local similarity between the query sequence and the database sequence, rather
5. ~Describe the primary purpose of the NCBI database in the field of bioinformatics.
than attempting to align the entire sequences.
The NCBI database, or the National Centre for Biotechnology Information database, serves as a central repository
- It has a user-friendly interface that makes it easy to input query sequences and interpret the results.
for a wide range of biological and genetic information. Its primary purpose is to provide researchers, scientists,
Applications of BLAST are:
and the public with access to data related to genetics, genomics, and other biological sciences. It hosts DNA and
- BLAST can be used to identify unknown sequences by comparing them with known sequences in a database
protein sequences, genomic data, literature references, and tools for sequence analysis. Researchers use NCBI
which helps in predicting the functions of proteins or genes.
to study genetic variations, conduct comparative genomics, and access valuable information for various biological
- BLAST can also be used in phylogenetic analysis which is important for understanding
the evolutionary relationships between different species. research purposes.
- BLAST can also be used to identify functionally conserved domains within proteins which is important
for predicting the functions of proteins.
6. ~Infer Local alignment and describe its application.
In local alignment, instead of attempting to align the entire length of the sequences, only the regions with the
3. ~Articulate the different types of phylogenetic tree. highest density of matches are aligned. This is useful for identifying short, conserved regions in protein or
- Rooted tree. Make the inference about the most common ancestor of the leaves or branches of the tree. nucleotide sequences. Local alignment programs are based on the Smith-Waterman algorithm. Local alignment
- Un-rooted tree. Make an illustration about the leaves or branches and do not make any assumption does not assume that two sequences in question have similarity over the entirement; rather it only finds local
regarding the most common ancestor. regions with the highest level of similarities between the two sequences and aligns these sequences without
- Bifurcating tree: Phylogenetic trees that only have two branches or leaves are referred to as regard for the alignment of the rest of the sequence regions. There are three primary methods for producing
bifurcating trees. Additionally, it can be divided into rooted and unrooted bifurcating trees. local alignments, dot Matrix method. dynamic programming and word or k tuple method.
- Multifurcating tree: Multiple branches can be found on a single node in a multifurcating tree, as the name Goal: See whether a substring in one sequence aligns well with a substring in the other.
suggests. Both a rooted multifurcating tree and an unrooted multifurcating tree are categories for it once more. Application:
1. Searching for local similarities in large sequence (example newly sequenced genome).
2. Searching conserved domains or motifs.

7. ~Develop the importance of studying Bioinformatics


Understanding Biological Processes: Unraveling molecular and genomic data enhances knowledge of genetics,
evolution, and disease mechanisms, driving progress in medicine, agriculture, and environmental science. 4. ~Biology is important in computer science. Analyze your answer with suitable examples.
Drug Discovery: Bioinformatics accelerates drug discovery by identifying targets, screening compounds, and Biology and computer science are two seemingly distinct fields, but there are several areas where they intersect
predicting drug effects, streamlining the development process and reducing costs. and complement each other. Here are some justifications for the importance of biology in computer science,
Personalized Medicine: Analyzing individual genetic profiles facilitates personalized medicine, optimizing treatment along with suitable examples:
effectiveness and minimizing adverse effects.
Bioinformatics: Uses computational techniques for genomics and proteomics. Example: Genome sequencing
Genomic Medicine: Accessible genome sequencing aids in identifying disease-causing mutations, understanding employs algorithms for DNA sequence assembly, advancing medical research.
genetic bases of diseases, and developing genetic tests, advancing genomic medicine.
Computational Biology: Models and simulates biological processes, aiding drug discovery. Example: Computer
models predict drug molecule interactions with biological targets, expediting drug development.
2. ~Discuss Bayes theorem, Naïve Bayes classifier and neighbor joining algorithms.
Bayes' theorem is a fundamental concept in probability theory and statistics that has wide-ranging applications in Machine Learning and AI: Analyzes biological datasets, predicts protein structures, and identifies drug
candidates. Example: Deep learning predicts protein structures from amino acid sequences, aiding drug
various fields, including bioinformatics. In bioinformatics, Bayes' theorem is a mathematical formula that describes design.
how to update the probability of a hypothesis (an event or statement) based on new evidence. Biological Networks: Computer science analyzes complex networks like gene regulatory systems. Example:
Bayes' theorem is used to calculate conditional probabilities. It allows us to update our beliefs about the likelihood Network analysis identifies key genes in disease pathways, offering therapeutic targets.
of a hypothesis being true when we obtain new data or evidence. Phylogenetics: Uses computational algorithms to infer evolutionary relationships. Example: Maximum
Formula: The formal expression of Bayes' theorem is as follows: Likelihood method reconstructs species evolutionary trees, aiding understanding of biodiversity.
P(H|E) = [P(E|H) * P(H) ] / P(E) In summary, biology and computer science collaborate to manage and analyze biological data, advancing
Where: understanding in medicine, biology, and technology development.
P(H|E) is the posterior probability of hypothesis H given evidence E.
P(E|H) is the probability of observing evidence E given that hypothesis H is true. P(H) is the 6. ~Illustrate K means.
prior probability of hypothesis H before considering evidence E. The k-means algorithm is an iterative clustering technique that aims to partition a dataset into 'k' clusters. The key
P(E) is the probability of observing evidence E. steps include initializing cluster centroids, assigning data points to the nearest centroid, updating centroids based on
assigned points, and repeating these steps until convergence. The strengths of k-means include simplicity and
1. ~Explain the process and significance of a microarray experiment in gene expression analysis. Discuss how scalability, but it assumes that clusters are spherical and equally sized, making it sensitive to initializations and
microarray technology has transformed our understanding of gene regulation. outliers.
Microarray experiments involve hybridizing RNA samples to microarray chips containing thousands of gene
probes, allowing simultaneous measurement of gene expression levels. They have significantly advanced our
understanding of gene regulation by enabling the study of gene expression on a genome- wide scale. Researchers 3. ~Describe the main objective of the BIRCH (Balanced Iterative Reducing and Clustering using
can identify differentially expressed genes under various conditions, discover biomarkers, and uncover regulatory Hierarchies) algorithm in data mining, and explain how it achieves this objective.
networks. This technology has been instrumental in fields such as cancer research, where it has helped identify Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can cluster
genes associated with specific cancer subtypes and potential therapeutic targets. large datasets by first generating a small and compact summary of the large dataset that retains as much
information as possible. This smaller summary is then clustered instead of clustering the larger dataset. The BIRCH
A microarray is a laboratory tool used to detect the expression of thousands of genes at the same time. DNA clustering algorithm consists of two stages:
microarrays are microscope slides that are printed with thousands of tiny spots in defined positions, with each spot
containing a known DNA sequence or gene. Often, these slides are referred to as gene chips or DNA chips. The - Building the CF Tree: BIRCH condenses large datasets into Clustering Feature (CF) entries, each represented
as (N, LS, SS), denoting cluster size, linear sum, and squared sum. CF entries can be composed
DNA molecules attached to each slide act as probes to detect gene expression, which is also known as the
hierarchically, and the initial CF tree may be optionally condensed for efficiency.
transcriptome or the set of messenger RNA (mRNA) transcripts expressed by a group of genes.
- Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree. A CF tree is a tree
3. ~Analyze the central dogma procedure in brief. where each leaf node contains a sub-cluster. Every entry in a CF tree contains a pointer to a child node and a
The central dogma of molecular biology is a fundamental concept that describes the flow of genetic information in CF entry made up of the sum of CF entries in the child nodes. Optionally, we can refine these clusters.
biological systems. It consists of three main processes:
Due to this two-step process, BIRCH is also called Two Step Clustering.
Replication: Copying DNA to produce identical molecules before cell division. DNA unwinds, and each strand
serves as a template for a complementary strand, resulting in two identical DNA molecules.
Transcription: DNA information used to synthesize RNA (mRNA) in the nucleus (eukaryotes) or nucleoid 4. ~Explain the basic idea behind the DIANA (Divisive Analysis) clustering algorithm in data mining, and describe
(prokaryotes). RNA polymerase reads DNA, producing a complementary mRNA strand. the key steps involved in its process.
Translation: mRNA guides protein synthesis on ribosomes in the cytoplasm. Ribosomes match mRNA codons with DIANA is also known as DIvisie ANAlysis clustering algorithm. It is the top-down approach form of hierarchical
specific amino acids, forming a polypeptide chain through tRNA. This chain folds into a functional protein. clustering where all data points are initially assigned a single cluster. Further, the clusters are split into two
In essence, the Central Dogma explains genetic information flow: DNA to RNA (transcription) to protein least similar clusters. This is done recursively until clusters groups are formed which are distinct to each other.
(translation), forming a foundational framework for genetic expression in organisms.
In step 1 that is the blue outline circle can be thought of as all the points are
1. Differentiate between cladogram and phylogenetic tree construction. assigned a single cluster. Moving forward it is divided into 2 red-colored
Cladograms and phylogenetic trees are functionally very similar, but they show different things. Cladograms do clusters based on the distances/density of points. Now, we have two red-
not indicate time or the amount of difference between groups, whereas phylogenetic trees often indicate time colored clusters in step 2. Lastly, in step 3 the two red clusters are further
spans between branching points. divided into 2 black dotted each, again based on density and distances to
2. Discuss the structure of a Nucleotide in brief and mention its different types. give us final four clusters. Since the points in the respective four clusters are
A nucleotide is the basic building block of nucleic acids (RNA and DNA). A nucleotide consists of a sugar molecule very similar to each other and very different when compared to the other
(either ribose in RNA or deoxyribose in DNA) attached to a phosphate group and a nitrogen- containing base. The cluster groups they are not further divided. Thus, this is how we get DIANA
bases used in DNA are adenine (A), cytosine (C), guanine (G) and thymine (T). clusters or top-down approached Hierarchical clusters.
5. Interpret what does Bayes' Theorem state.
Bayes' theorem describes the probability of occurrence of an event related to any condition. It is also considered 6. Identify the key characteristics of dynamic programming algorithms in the context of sequence
for the case of conditional probability. Bayes theorem is also known as the formula for the probability of “causes”. alignment, and inspect that why are they used.
Dynamic programming algorithms are characterized by their ability to solve complex problems by breaking them
P(H|E) = [ P(E|H) * P(H) ] / P(E) down into smaller overlapping subproblems. They are used in sequence alignment to find the optimal alignment by
Where: considering all possible alignments and selecting the best one.
P(H|E) is the posterior probability of hypothesis H given evidence E. 7. In sequence alignment, analyze the primary purpose of heuristic alignment algorithms, and give an example
P(E|H) is the probability of observing evidence E given that hypothesis H is true. P(H) is the of one such algorithm?
prior probability of hypothesis H before considering evidence E. Heuristic alignment algorithms aim to find reasonably good alignments quickly, often by making simplified
assumptions. An example is the BLAST (Basic Local Alignment Search Tool) algorithm, which rapidly identifies
P(E) is the probability of observing evidence E.
local sequence similarities.
6. Identify the main idea behind the Naïve Bayes classifier, and infer that how does it handle 8. Identify the use of Neighbor Joining Algorithm in the context of phylogenetic tree construction?
feature independence? The Neighbor Joining Algorithm is used to construct phylogenetic trees from distance matrices. It iteratively joins
pairs of taxa or clusters based on their pairwise distances to build a hierarchical tree representing evolutionary
relationships.
The Naïve Bayes classifier assumes that all features are conditionally independent, given the class label. It
calculates the probability of a data point belonging to a class by multiplying the individual conditional
probabilities of each feature given that class. 9. Relate the fundamental concept behind the dynamic programming algorithms in pairwise sequence
1. Find the key difference between k-means and k-medoid clustering algorithms? alignment?
K-means uses the mean (centroid) of data points in a cluster to represent it, whereas k-medoid uses the actual Dynamic programming algorithms, such as the Needleman-Wunsch and Smith-Waterman algorithms, use a
data point (medoid) to represent the cluster. matrix-based approach to find the optimal alignment by considering all possible alignment paths and choosing the
K-Means uses the average of all points in a cluster (centroid), which may not be an actual data point1. K- one with the highest score.
Medoids selects an actual data point as the center (medoid) of the cluster1. 10. Interpret some common challenges faced in the integration of biological data from various sources in
systems biology studies?
Outlier Sensitivity: K-Means is sensitive to outliers2. K-Medoids is more robust to outliers and noise2. Challenges include data heterogeneity, differing data formats, data quality issues, and the need for robust data
integration methods to combine diverse biological datasets effectively.
Computational Complexity: K-Means is computationally less expensive compared to K-Medoids3.
11. In the context of bioinformatics, identify the significance of microarray experiments.
Microarray experiments are used to measure the expression levels of thousands of genes simultaneously, helping
2. Discover what are some common challenges faced in the integration of biological data.
Some common challenges in the integration of biological data include: researchers study gene expression patterns under different conditions and gain insights into cellular processes.
Data Heterogeneity: Biological data comes from various sources, such as genomics, proteomics, and clinical 12. Analyze some critical issues related to the design of a biological information system, especially when dealing
studies, each with different formats and standards, making it challenging to integrate. with large- scale datasets?
Data Volume: The sheer volume of biological data generated is enormous, leading to issues related to storage, Issues include data storage and retrieval efficiency, data security, scalability to handle large datasets, user-friendly
processing, and analysis. interfaces for data access and analysis, and compliance with ethical and privacy standards.
Data Quality: Ensuring the accuracy and consistency of data from diverse sources is a significant challenge, as 13. Explain in a few sentences what a scoring model is in bioinformatics.
A scoring model in bioinformatics is a mathematical system used to assign scores or values to various biological
errors or inconsistencies can lead to misleading conclusions. These challenges make the integration of biological
sequence alignments. It helps determine how well two sequences align with each other, with higher scores
data a complex task that requires specialized tools and techniques to address.
indicating better alignment. Scoring models are essential in tasks such as sequence alignment, where they aid in
identifying similarities and differences between biological sequences.
3. Outline the primary advantage of the BIRCH (Balanced Iterative Reducing and Clustering using
Hierarchies) clustering algorithm?
14. Express why are positive scores typically assigned to matching nucleotides or amino acids in scoring
BIRCH excels in managing large datasets, a critical advantage in the era of big data. Its hierarchical structure and space-
models for sequence alignment?
saving data summarization techniques enable it to handle extensive data efficiently. The algorithm iteratively reduces the
Positive scores are assigned to matching nucleotides or amino acids in scoring models because they reflect the
data size and performs clustering at multiple levels, making it suitable for applications dealing with vast amounts of
idea that identical or similar sequences in biological molecules are biologically significant. A positive score
information. This efficiency is particularly valuable in scenarios where traditional clustering algorithms may struggle due to
encourages the alignment algorithm to prioritize regions of similarity, helping to identify homologous sequences
memory or processing constraints.
and functional similarities between biological molecules.
4. Interpret that what do grid-based clustering methods primarily rely on to divide the data space into cells 15. Briefly describe the role of gap penalties in scoring models for sequence alignment.
or regions? Gap penalties in scoring models for sequence alignment are used to account for the introduction of gaps (insertions
Grid-based methods rely on a grid structure to divide the data space into cells or regions, making them suitable or deletions) in the alignment. Gap penalties are typically negative values. They discourage excessive gap creation,
for handling uniformly distributed data. ensuring that the alignment algorithm favors alignments with fewer gaps. This helps to maintain biologically
5. Analyze fundamental concept behind ISODATA(Iterative Self-Organizing Data Analysis Technique). meaningful alignments by penalizing gaps that may not reflect true evolutionary relationships.
Grid-based clustering methods primarily rely on a structured grid to partition the data space into cells or regions. This 16. Explain the significance of GenBank, one of the primary databases hosted by NCBI.
approach is particularly effective when dealing with uniformly distributed data. The grid structure provides a GenBank is a critical component of the NCBI database. It is a repository for DNA and RNA sequences submitted by
systematic and organized way to divide the dataset into discrete units, making it easier to identify clusters and researchers worldwide. The significance of GenBank lies in its role as a comprehensive and freely accessible
analyze spatial relationships. This method simplifies the clustering process, especially in scenarios where the collection of genetic information. Researchers can deposit their sequences into GenBank, allowing others to access
distribution of data points is regular and can be aligned with the grid structure. The reliance on a grid facilitates and use this data for various research purposes, including gene discovery, phylogenetic studies, and
efficient data organization and retrieval, contributing to the effectiveness of grid-based clustering methods in handling
certain types of datasets. understanding the genetic basis of diseases. GenBank promotes data sharing, collaboration, and scientific
advancement in the field of molecular biology and genetics.

You might also like