NTCC Report For Plagiarism Check

Topic: Protein Sequence Analysis of Covid19 using Python.
Abstract
Protein sequence analysis of Covid-19 is an important area of research that has been widely
studied since the outbreak of the pandemic. In this report, I will provide an overview of how
Python can be used for protein sequence analysis of Covid-19. Protein sequence analysis of
Covid-19 has been a major area of research since the outbreak of the pandemic. This type of
analysis involves examining the amino acid sequence of the proteins that make up the virus,
including the spike protein, nucleocapsid protein, and various enzymes. By analyzing these
sequences, researchers can gain insight into the structure, function, and evolution of the virus,
as well as identify potential targets for drug development and vaccine design. Moreover,
understanding the protein sequences of Covid-19 can help researchers design more effective
vaccines. For instance, the spike protein of the virus is a key target for vaccine development, as
it is responsible for binding to host cells and facilitating viral entry into the cell. By analyzing the
sequence of the spike protein, researchers can identify regions that are highly conserved and
likely to be effective targets for vaccine development. In addition to identifying targets for drug
development and vaccine design, protein sequence analysis can also provide insights into the
mechanisms by which the virus interacts with host cells. For example, researchers can use
protein sequence analysis to identify enzymes that are important for viral replication, and
potentially develop inhibitors that target these enzymes. Overall, protein sequence analysis of
Covid-19 is a crucial area of research that has the potential to yield significant breakthroughs in
the development of effective treatments and vaccines. By analyzing the amino acid sequences
of the various proteins that make up the virus, researchers can gain valuable insights into its
structure, function, and evolution, and identify potential targets for drug development and
vaccine design.
Overview of the Covid-19 pandemic
The SARS-CoV-2 virus has triggered a persistent worldwide public health emergency known as
the COVID-19 pandemic. The outbreak of the virus was first identified in Wuhan, China in
December 2019, and it quickly spread to other countries, eventually becoming a global
pandemic in early 2020. The virus spreads primarily through respiratory droplets when an
infected person talks, coughs, or sneezes, and can also be spread by touching a surface
contaminated with the virus and then touching one's mouth, nose, or eyes. The symptoms of
COVID-19 can range from mild to severe, and may include fever, cough, shortness of breath,
fatigue, body aches, loss of smell or taste, and gastrointestinal symptoms. In severe cases, the
virus can cause pneumonia, acute respiratory distress syndrome, and organ failure, and can be
fatal. In response to the pandemic, governments around the world implemented a variety of
measures to slow the spread of the virus, including lockdowns, social distancing, and mask
mandates. Vaccine development also became a top priority, with multiple vaccines being
developed and distributed globally. As of May 2023, the pandemic continues to have a
significant impact on global health and the global economy. While the rollout of vaccines has led
to a decrease in cases and deaths in some countries, the emergence of new variants of the virus
has led to ongoing concerns about the pandemic's long-term impact. The COVID-19 pandemic
has had far-reaching consequences beyond the immediate health impacts. It has disrupted
economies, led to the closure of schools and businesses, and caused widespread
unemployment and financial hardship. Governments, healthcare systems, and individuals
around the world have been forced to adapt to the challenges of the pandemic. The pandemic
has also spurred an unprecedented level of global collaboration and scientific research.
Scientists and researchers around the world have worked together to develop vaccines,
treatments, and new technologies to address the challenges of the pandemic. The speed of
vaccine development and distribution has been a notable achievement, with multiple safe and
effective vaccines being developed in record time. While the COVID-19 pandemic has been a
challenging and disruptive event, it has also brought about new opportunities for innovation
and collaboration. As the world continues to grapple with the ongoing effects of the pandemic,
it is likely that we will continue to see new approaches and solutions emerge.
Importance of Protein sequence analysis of Covid-19
Protein sequence analysis of Covid-19 is important for several reasons:

1. Understanding viral pathogenesis: COVID-19 is caused by the SARS-CoV-2 virus, which
uses the spike protein to bind to human ACE2 receptors, allowing it to enter cells.
Analyzing this protein sequence helps researchers understand infection mechanisms and
potential drug targets.
2. Developing diagnostics and vaccines: Protein sequence analysis is crucial for COVID-19
diagnostics and vaccines. It helps design effective tests for different virus strains by
identifying conserved protein regions. Furthermore, it informs vaccine development by
revealing protein structure and function for a strong immune response.
3. Tracking viral evolution: COVID-19 evolves with new variants. Protein analysis tracks
changes, informing better treatments and vaccines. This information can aid in the
development of more effective treatments and vaccines.
4. Drug discovery: Protein sequence analysis pinpoints drug targets by finding conserved
viral protein regions, helping design effective COVID-19 treatments.
In summary, protein sequence analysis of Covid-19 is important for understanding viral
pathogenesis, developing diagnostics and vaccines, tracking viral evolution, and aiding in drug
discovery. By gaining a better understanding of the structure, function, and evolution of Covid-
19 proteins, researchers can develop more effective strategies to combat the pandemic.
Data types and operators in Python
In Python, there are several data types and operators you can use.
Here are the commonly used ones:
1. Numbers:
 Integers: Whole numbers without a fractional part (e.g., 3, -15, 0).
 Floating-Point Numbers: Numbers with a fractional part (e.g., 3.14, -2.5, 0.0).
 Complex Numbers: Numbers with a real and imaginary part (e.g., 3+2j, -1-4j).
2. Strings:
 A series of characters surrounded by either single quotation marks ('') or double
quotation marks ("").
 Example: "Hello, World!"
3. Lists:
 A square-bracketed grouping that contains a structured arrangement of elements.
 Items can be of different data types and can be modified.
4. Dictionaries:
 An unordered collection of key-value pairs enclosed in curly braces ({}) or dict()
constructor.
 Keys must be unique and immutable (strings, numbers, or tuples).
 Values may belong to various data types.
How to create, import, and use modules and packages effectively in Python
Creating, importing, and using modules and packages effectively is an essential aspect of Python
development.
Here is an instructional manual on how to do this:
A) Creating a Module:
1. Generate a fresh Python document and ensure it has a .py file extension.
2. Write your code, including functions, classes, or variables, in the file.
3. Save the file with an appropriate name (e.g., mymodule.py).
B) Using a Module:
1. Import the module using the import statement: `import mymodule`.
2. Use the functions, classes, or variables defined in the module by prefixing them with the
module name, e.g., `mymodule.my_function()`.
C) Importing Specific Items from a Module:
1. Import specific functions or classes from a module using the from-import statement: `from
mymodule import my_function`.
2. Use the imported function directly without prefixing the module name, e.g., `my_function()`.
D) Creating a Package:
1. Create a new directory with a descriptive name for your package.
2. Inside the package directory, create a __init__.py file (can be empty) to make it a package.
3. Add Python module files (*.py) containing your code to the package directory.
E) Using a Package:
1. Import a module from the package using the dot notation: `import package.module`.
2. Use the functions, classes, or variables from the imported module by prefixing them with the
module name, e.g., `package.module.my_function()`.
F) Importing Specific Items from a Package:
1. Import specific functions or classes from a module in a package: `from package.module
import my_function`.
Methods for protein sequence analysis
There are several methods for protein sequence analysis that are commonly used in research,
including:
1. Multiple sequence alignment (MSA): Multiple Sequence Alignment (MSA) compares
protein sequences to find similarities and conservation, aiding in understanding protein
evolution, function, and potential drug/vaccine targets.
2. Phylogenetic analysis: Phylogenetic analysis creates a tree diagram to trace protein
evolution, uncovering shared origins, divergence, and insights into protein development.
3. Homology modeling: Homology modeling predicts a protein's 3D structure from its
amino acid sequence, aiding in identifying functional regions and understanding
molecular interactions.
4. Protein-protein interaction analysis: This method studies protein interactions to discover
drug targets, using techniques like co-immunoprecipitation and yeast two-hybrid assays.
5. Functional analysis: This method assesses a protein's biological role through techniques
like enzymatic assays, protein profiling, and gene knockout experiments, both in vivo
and in vitro.
Overall, these methods can be used to gain insights into the structure, function, and evolution
of proteins, and can help identify potential targets for drug development or vaccine design.
Some databases:
1. PDB (Protein Data Bank)
2. CATH (Class, Architecture, Topology, Homologous superfamily)
3. UniProt
4.Pfam
Overview
Protein sequence analysis examines a protein's amino acid sequence to identify features like
domains, structure, and binding sites. Covid-19's protein sequence is vital for its infectivity and
pathogenicity.
Python is a potent language commonly employed in bioinformatics and protein sequence
analysis, featuring libraries like Biopython, NumPy, and SciPy for this purpose.
For COVID-19 protein sequence analysis with Python, start by acquiring data from sources like
the NCBI database. Then, analyze it using Python libraries and tools.
A common method for protein sequence analysis involves using the Biopython library to
perform sequence alignments, identifying conserved regions or motifs.
Protein sequence analysis includes predicting protein structure and function, using tools like
homology modeling and molecular dynamics simulations. Python offers libraries and tools like
PyMOL, Modeller, and GROMACS for this purpose.
Installation
To install BioPython, you can use the pip package manager by running the command: "pip
installs biopython" in the command line interface. Get the COVID-19 genome from NCBI:
MN908947 – Covid-19 genome used here was sequenced from a bronchoalveolar lavage fluid
sample of one of her market-working patients who was admitted to Wuhan Central Hospital on
December 26, 2019.
How to load the BioPython package in Python and how to check its attributes.
In [1]:
# Load the Pkg

import Bio
In [3]:
# Check the Attributes

dir(Bio)
Out[3]:
['BiopythonDeprecationWarning',
'BiopythonExperimentalWarning',
'BiopythonParserWarning',
'BiopythonWarning',
'MissingExternalDependencyError',
'MissingPythonDependencyError',
'__builtins__',
'__cached__',
'__doc__',
'__file__',
'__loader__',
'__name__',
'__package__',
'__path__',
'__spec__',
'__version__',
'_parent_dir',
'os',
'warnings']
First, the "import Bio" statement loads the BioPython package, which provides a range of
functions and tools for bioinformatics and molecular biology analysis.
The next code block, "dir(Bio)", returns a list of all attributes and methods available in the
BioPython package. These include warning messages, parent directories, and version
information, among others.
The output of this code block shows that the BioPython package has several attributes and
warning messages. These warnings are related to deprecated or experimental features and are
intended to alert the user to potential issues with their code. The output also includes
information about the version of BioPython and the operating system's path and warnings
module.
In summary, this code snippet demonstrates how to load and explore the attributes of the
BioPython package in Python, which is an essential step in using the package for bioinformatics
and molecular biology analysis.
Let’s start:
In (1):
from Bio import Entrez, SeqIO

Entrez.email = ""
handle = Entrez.efetch(db="nucleotide", id="MN908947", rettype="gb", retmode="text")
recs = list(SeqIO.parse(handle, 'gb'))
handle.close()
In [2]:
recs
Out[2]:
[SeqRecord(seq=Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT..
.AAA'), id='MN908947.3', name='MN908947', description='Severe acute respiratory syndrome
coronavirus 2 isolate Wuhan-Hu-1, complete genome', dbxrefs=[])]
In [3]:
covid_dna = recs[0].seq
In [4]:
covid_dna
Out[4]:
Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA')
In [5]:
print(f'The genome of Covid-19 consists of {len(covid_dna)} nucleotides.')

The genome of Covid-19 consists of 29903 nucleotides.
In [6]:
# molecular weight
from Bio.SeqUtils import molecular_weight
molecular_weight(covid_dna)
Out[6]:
9241219.214400413
In [7]:
# GC content - higher GC content implies more stable molecule due to G and C forming triple
hydrogen bonds
from Bio.SeqUtils import GC
GC(covid_dna)
Out[7]:
37.97277865097148
This code is using Biopython to retrieve and analyze the DNA sequence of the SARS-CoV-2
genome.
The first line from Bio import Entrez, SeqIO imports two modules from Biopython, Entrez and
SeqIO, which will be used to retrieve the sequence record from the NCBI database and to paste
it, respectively.
The next line sets the email address to be used for accessing the NCBI database, as required by
NCBI.
The Entrez.efetch function is used to retrieve the sequence record for the specified ID
MN908947 from the nucleotide database. The rettype="gb" and retmode="text" parameters
indicate that the record should be returned in GenBank format.
The SeqIO.parse function is then used to paste the record, which is returned as a SeqRecord
object. This object is then converted to a list using the list function, and the handle is closed
using the handle.close method.
The recs variable contains the parsed sequence record as a list of SeqRecord objects, and
recs[0].seq retrieves the DNA sequence from the first record in the list, which corresponds to
the SARS-CoV-2 genome.
The len function is used to determine the length of the DNA sequence, and the
molecular_weight function from Bio.SeqUtils is used to calculate the molecular weight of the
DNA sequence.
Finally, the GC function from Bio.SeqUtils is used to calculate the GC content of the DNA
sequence.
Distribution of nucleotides in the COVID-19 genome
In [8]:
count_nucleotides = {
'A': covid_dna.count('A'),
'T': covid_dna.count('T'),
'C': covid_dna.count('C'),
'G': covid_dna.count('G')
}
In [9]:
count_nucleotides
Out[9]:
{'A': 8954, 'T': 9594, 'C': 5492, 'G': 5863}
In [10]:
import matplotlib.pyplot as plt

width = 0.5
plt.bar(count_nucleotides.keys(), count_nucleotides.values(), width, color=['b', 'r', 'm', 'c'])
plt.xlabel('Nucleotide')
plt.ylabel('Frequency')
plt.title('Nucleotide Frequency')
Out[10]:
Text(0.5, 1.0, 'Nucleotide Frequency')
Nucleotides A and T are more frequent than C and G. The frequency of nucleotides A and T is
higher than C and G in the genome of the virus that causes Covid-19. This information is
important for understanding the genetic makeup of the virus, which can be useful for
developing vaccines and treatments.
So the question is:

How do we extract information from this long string?
Gene expression is the vital process of using genetic information to create functional gene
products, usually proteins
Transcription
Gene expression begins with transcription. To create an RNA molecule, the DNA sequence of a
gene must be copied.
In [11]:
covid_mrna = covid_dna.transcribe()
covid_mrna
Out[11]:
Seq('AUUAAAGGUUUAUACCUUCCCAGGUAACAAACCAACCAACUUUCGAUCUCUUGU...AAA')
In genetics, DNA contains vital genetic information for organisms but is confined to the cell
nucleus. To enable protein synthesis, RNA is used to transfer genetic data from DNA. RNA is
formed through transcription. This occurs in the nucleus (eukaryotes) or cytoplasm
(prokaryotes). In COVID-19, RNA serves as the genetic material. The transcribe method in the
code creates mRNA from the virus's DNA.
The `transcribe` method in Biopython converts DNA to RNA by replacing thymine (T) with uracil
(U). It produces an mRNA sequence that complements the DNA template. In a code example,
COVID-19 DNA is transcribed into mRNA, which is then printed.
Translation
Translation is the process by which information transmitted from the DNA as messenger RNA is
converted into a string of amino acids.
In [12]:
covid_aa = covid_mrna.translate()
covid_aa
/Users/lanacaldarevic/opt/miniconda3/envs/12daysofbiopython/lib/python3.8/site-packages/
Bio/Seq.py:2979: BiopythonWarning: Partial codon, len(sequence) not a multiple of three.
Explicitly trim the sequence or add trailing N before translation. This may become an error in
future.
warnings.warn(
Out[12]:
Seq('IKGLYLPR*QTNQLSISCRSVL*TNFKICVAVTRLHA*CTHAV*LITNYCR*QD...KKK')
In [13]:
#most common amino acids

from collections import Counter
common_amino = Counter(covid_aa)
common_amino.most_common(10)
Out[13]:
[('L', 886),
('S', 810),
('*', 774),
('T', 679),
('C', 635),
('F', 593),
('R', 558),
('V', 548),
('Y', 505),
('N', 472)]
In [14]:
del common_amino['*']
width = 0.5
plt.bar(common_amino.keys(), common_amino.values(), width, color=['b', 'r', 'm', 'c'])
plt.xlabel('Amino Acid')
plt.ylabel('Frequency')
plt.title('Protein Sequence Frequency')
Out[14]:
Text(0.5, 1.0, 'Protein Sequence Frequency')
Let’s understand this first section of code first.

This block of code performs the translation of the COVID-19 DNA sequence to its corresponding
amino acid sequence using the `.translate()` method of the `covid_mrna` sequence object. The
resulting amino acid sequence is stored in the `covid_aa` variable as a `Seq` object.
As the warning message indicates, the length of the sequence is not a multiple of three, which
means that there may be partial codons at the end of the sequence. These partial codons may
affect the translation result and could produce errors in future versions of Biopython.
To analyze the resulting amino acid sequence, the code uses the `Counter` function from the
`collections` module to count the frequency of each amino acid in the sequence and store the
results in the `common_amino` dictionary. The `most_common` method of the `Counter` object
is used to extract the 10 most common amino acids and their frequencies.
Since stop codons are not useful for protein analysis, the `*` symbol representing the stop
codon is removed from the dictionary using the `del` statement.
Finally, a bar plot is created using the `matplotlib` library to visualize the frequency of each
amino acid in the protein sequence. The plot shows the x-axis as the amino acid one-letter code
and the y-axis as the frequency count. The plot is titled "Protein Sequence Frequency".
Let’s move further with coding

In [15]:
print(f"Covid-19's genome has {sum(common_amino.values())} amino acids")

Covid-19's genome has 9193 amino acids
The split() function splits the sequence at any stop codon and keeps the amino acids chains
separated.
In [16]:
proteins = covid_aa.split('*')
In [17]:
proteins[:5]
Out[17]:
[Seq('IKGLYLPR'),
Seq('QTNQLSISCRSVL'),
Seq('TNFKICVAVTRLHA'),
Seq('CTHAV'),
Seq('LITNYCR')]
In [18]:
print(f'We have {len(proteins)} amino acids in the covid-19 genome')

We have 775 amino acids in the covid-19 genome
It's worth to mention that not all the amino acids sequences are proteins. Functional proteins
are only encoded by sequences that consist of more than 20 amino acids. The short amino acid
sequences are oligopeptides and have other functionalities. In this context, our emphasis will
be on proteins consisting of amino acid chains that exceed 20 in length.
In [19]:
for protein in proteins:

if len(protein) < 20:
proteins.remove(protein)
In [20]:
print(f'We have {len(proteins)} proteins with more than 20 amino acids in the covid-19
genome')
We have 409 proteins with more than 20 amino acids in the covid-19 genome
In [25]:
top_5_proteins = sorted(proteins, key = len)
In [26]:
top_5_proteins[-1]
Out[26]:
Seq('CTIVFKRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQ...VNN')
In [27]:
len(top_5_proteins[-1])
Out[27]:
2701
We usually save this protein to file for further analysis

In [24]:
with open("protein_seq.fasta", "w") as file:

file.write(f">covid protein\n{top_5_proteins[-1]}")
This code is processing the genetic sequence of the COVID-19 virus to identify the proteins that
are produced from it and then saving the longest protein to a file for further analysis.
The first line translates the mRNA sequence of the virus into the corresponding amino acid
sequence using the `translate` function.
The `split` function is then used to separate the amino acid sequence into a list of proteins. The
function splits the sequence at every stop codon, which is represented by the `*` character, and
keeps the amino acid chains separated.
The code then filters out any amino acid chains that are less than 20 amino acids long, as these
shorter sequences are usually not functional proteins. This leaves only the proteins with 20 or
more amino acids, which are considered to be functional proteins.
The total number of proteins and the number of functional proteins are printed using the `len()`
function.
The code then identifies the longest protein sequence by sorting the list of proteins by length
and selecting the last element.
Finally, the longest protein sequence is saved to a file named `protein_seq.fasta` using a `with`
statement that opens the file in write mode and writes the sequence in the FASTA format,
which starts with a header line that begins with `>` followed by a description of the sequence,
and then the sequence itself on the next line.
The first line of code prints a string that includes the number of amino acids in the Covid-19
genome. It uses the `sum` function to add up the values of the `common_amino` dictionary,
which was created earlier in the code to count the frequency of each amino acid in the protein
sequence.
The second block of code splits the `covid_aa` sequence into individual protein sequences by
using the `split` function. It splits the sequence at each stop codon ('*') and saves each protein
sequence as a separate `Seq` object in the `proteins` list.
The third line of code prints the number of proteins in the `proteins` list.
The fourth block of code removes any protein sequences that are shorter than 20 amino acids
long, as these sequences are not likely to be functional proteins. It then prints the number of
remaining protein sequences in the `proteins` list.
The fifth block of code sorts the `proteins` list by the length of each sequence and saves the
longest sequence (i.e. the last item in the sorted list) as `top_5_proteins`.
The sixth line of code gets the length of the longest protein sequence and prints it.
The last block of code saves the `top_5_proteins` sequence to a file called "protein_seq.fasta",
with the header line ">covid protein". This file can be used for further analysis of the protein
sequence.
Summary of Findings
 Sequence length: 29,903 base pairs

 GC content: 37.97%
 Has high amount of Leucine L and Serine S
 409 proteins with more than 20 amino acids
 The largest protein is of length of 2,701 amino acid.
The analysis of COVID-19's genome, with a length of 29,903 base pairs and a 37.97% GC
content, revealed potential biological importance in Leucine (L) and Serine (S) amino acids. 409
proteins in the genome have over 20 amino acids, with the longest having 2,701 amino acids,
coding for functional proteins. The most common amino acids point to the virus's functioning.
This data aids in vaccine and therapy development and provides insights into virus mechanisms
and interactions with human cells, valuable for COVID-19 research.
Challenges in protein sequence analysis of Covid-19
Protein sequence analysis of Covid-19 can be challenging due to several factors, including:
1. Lack of complete sequence data: Although the SARS-CoV-2 genome is known, gaps in
our understanding of the virus's proteome make it difficult to identify and characterize
all its proteins.
2. Rapidly evolving virus: SARS-CoV-2 mutates quickly, leading to the emergence of new
variants, which complicates treatment and vaccine development.
3. Large amount of data: Analyzing Covid-19 protein sequences demands extensive data
processing, often time-consuming and computationally intense. Specialized
bioinformatics tools and expertise may not be accessible everywhere for this purpose.
4. Limited experimental data: Bioinformatics tools offer insights into protein structure and
function but are constrained by data availability, potentially hindering virus proteome
understanding and treatment development.
Despite challenges, protein sequence analysis is crucial for comprehending the SARS-CoV-2's
structure, function, and evolution. With new bioinformatics tools and experimental data,
researchers can identify drug and vaccine targets.
Applications of protein sequence analysis
Protein sequence analysis has many applications, including:
1. Drug design: Protein sequence analysis helps identify vital, conserved viral protein
regions as potential drug targets, enabling the development of broad-spectrum antiviral
drugs.
2. Vaccine design: Protein sequence analysis helps find conserved viral protein regions for
broad-spectrum vaccine development.
3. Disease diagnosis: Protein sequence analysis helps identify disease-associated protein
mutations, enabling the development of diagnostic tests.
4. Evolutionary analysis: Protein sequence analysis can reveal the evolutionary relationships
between organisms, including viruses. By comparing protein sequences, researchers can learn
how organisms have changed over time.
5. Structural biology: Protein sequence analysis can predict protein structure, which is key to
understanding protein function.
Protein sequence analysis is a powerful tool that researchers use to understand protein structure,
function, and evolution. This knowledge can inform the development of new treatments and vaccines.
Future directions
Protein sequence analysis of Covid-19 is a rapidly evolving field of research, and there are
several future directions that hold promise for advancing our understanding of the virus and its
proteins.
Some potential directions include:
1. Analysis of new variants: New SARS-CoV-2 variants require ongoing analysis of viral proteins
to understand how they differ from the original strain and predict how mutations impact
structure and function.
2. Integration of experimental data: While bioinformatics tools are powerful, they need
experimental data to work. As we get more experimental data, we can combine it with
bioinformatics analysis to better understand virus proteins.
3. Multi-omics analysis: Protein sequence analysis combined with other omics data can reveal
the virus's biology and identify new drug and vaccine targets.
4. Machine learning: Machine learning can help us analyze and interpret the growing amount of
protein sequence data more effectively. It can predict protein structures and identify new
patterns that would be difficult to find with traditional methods.
5. Integration with clinical data: Clinical data and protein sequence analysis can be integrated to
understand virus-human interactions. This could lead to new treatment targets or personalized
medicine approaches.
Overall, Protein sequence analysis of Covid-19 is a promising field with many possible future directions,
such as developing new treatments and vaccines. Researchers can use new technologies and integrate
data from multiple sources to make progress.
Conclusion
In conclusion, this report analyzed the genome of Covid-19 and identified key characteristics of
the virus's genetic makeup. The analysis revealed that the sequence length of the genome is
29,903 base pairs, and it has a GC content of 37.97%. Additionally, the virus has a high amount
of Leucine L and Serine S amino acids, with 409 proteins identified that have more than 20
amino acids.
The largest protein identified in the analysis is 2,701 amino acids long. The report also highlights
the importance of identifying and analyzing proteins with more than 20 amino acids since only
these sequences are thought to code for functional proteins.
The findings of this report can contribute to a better understanding of Covid-19 and provide a
basis for further research on the virus. Moreover, the analysis of the virus's genome may aid in
the development of vaccines and treatments for Covid-19, which remains a significant public
health threat globally. Overall, this report's insights can help researchers and healthcare
professionals in understanding the genetic makeup of Covid-19, its characteristics, and its
impact on human health.

NTCC Report For Plagiarism Check

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NTCC Report For Plagiarism Check

Uploaded by

Copyright:

Available Formats

Topic: Protein Sequence Analysis of Covid19 using Python.

Protein sequence analysis of Covid-19 is important for several reasons:

# Load the Pkg

# Check the Attributes

from Bio import Entrez, SeqIO

print(f'The genome of Covid-19 consists of {len(covid_dna)} nucleotides.')

Distribution of nucleotides in the COVID-19 genome

{'A': 8954, 'T': 9594, 'C': 5492, 'G': 5863}

import matplotlib.pyplot as plt

Text(0.5, 1.0, 'Nucleotide Frequency')

So the question is:

#most common amino acids

Text(0.5, 1.0, 'Protein Sequence Frequency')

Let’s understand this first section of code first.

Let’s move further with coding

print(f"Covid-19's genome has {sum(common_amino.values())} amino acids")

print(f'We have {len(proteins)} amino acids in the covid-19 genome')

for protein in proteins:

top_5_proteins = sorted(proteins, key = len)

We usually save this protein to file for further analysis

with open("protein_seq.fasta", "w") as file:

 Sequence length: 29,903 base pairs

Protein sequence analysis has many applications, including:

Some potential directions include:

You might also like