You are on page 1of 6

COVID2–19 DNA sequence data using python.

Major Modules Used:


Bio Python
Squiggle
Pandas

Importing Modules:

from __future__ import division


from Bio.SeqUtils import ProtParam
import warnings
import pandas as pd
from Bio import SeqIO
from Bio.Data import CodonTable

We will use Bio.SeqIO from Biopython for parsing


DNA sequence data(fasta). It provides a simple
uniform interface to input and output assorted
sequence file formats.

for sequence in SeqIO.parse(r'Covid.fna', "fasta"):


print(sequence.seq)
print(len(sequence), 'nucliotides')

DNAsequence = SeqIO.read(r'Covid.fna', "fasta")


print(DNAsequence)
Since input sequence is FASTA (DNA), and
Coronavirus is RNA type of virus, we need to:
Transcribe DNA to RNA (ATTAAAGGTT… =>
AUUAAAGGUU…)
Translate RNA to Amino acid sequence
(AUUAAAGGUU… => IKGLYLPR*Q…)
In the current scenario, the .fna file starts with
ATTAAAGGTT, then we call transcribe() so T
(thymine) is replaced with U (uracil), so we get the
RNA sequence which starts with AUUAAAGGUU
The transcribe() method will convert the DNA to
mRNA.
DNA = DNAsequence.seq
mRNA = DNA.transcribe()
print(mRNA)
print('Size : ', len(mRNA))

The difference between the DNA and the mRNA is


just that the bases T (for Thymine) are replaced
with U (for Uracil).
Next, we are going to translate the mRNA sequence
to amino-acid sequence using translate() method,
we get something like IKGLYLPR*Q ( is so-called
STOP codon, effectively is a separator for proteins).
Amino_Acid = mRNA.translate(table=1, cds=False)
print('Amino Acid', Amino_Acid)
print("Length of Protein:", len(Amino_Acid))
print("Length of Original mRNA:", len(mRNA))

The standard genetic code is traditionally


represented as an RNA codon table because, when
proteins are made in a cell by ribosomes, it is
mRNA that directs protein synthesis. The mRNA
sequence is determined by the sequence of
genomic DNA. Here are some features of codons:
Most codons specify an amino acid
Three “stop” codons mark the end of a protein
One “start” codon, AUG, marks the beginning of a
protein and also encodes the amino acid
methionine.
A series of codons in part of a messenger RNA
(mRNA) molecule. Each codon consists of three
nucleotides, usually corresponding to a single
amino acid. The nucleotides are abbreviated with
the letters A, U, G, and C. This is mRNA, which
uses U (uracil). DNA uses T (thymine) instead. This
mRNA molecule will instruct a ribosome to
synthesize a protein according to this code. Source

print(CodonTable.unambiguous_rna_by_name['Sta
ndard'])
Now we are extracting the Proteins (chains of
amino acids), basically separating at the stop
codon, marked by * (ASTERISK). Then let’s remove
any sequence less than 20 amino acids long, as
this is the smallest known functional protein

Proteins = Amino_Acid.split('*')
df = pd.DataFrame(Proteins)
df.describe()
print('Total proteins:', len(df))
def conv(item):
return len(item)
def to_str(item):
return str(item)
df['sequence_str'] = df[0].apply(to_str)
df['length'] = df[0].apply(conv)
df.rename(columns={0: "sequence"}, inplace=True)
df.head()
functional_proteins = df.loc[df['length'] >= 20]

print('Total functional proteins:',


len(functional_proteins))

print(functional_proteins.describe())

Protein Analysis With The Protparam Module In


Biopython using ProtParam.

poi_list = []
MW_list = []

for record in Proteins[:]:


print("\n")
X = ProtParam.ProteinAnalysis(str(record))
POI = X.count_amino_acids()
poi_list.append(POI)
MW = X.molecular_weight()
MW_list.append(MW)
print("Protein of Interest = ", POI)
try:
print("Amino acids percent = ",
str(X.get_amino_acids_percent()))
except ZeroDivisionError:
pass
print("Molecular weight = ", MW)
try:
print("Aromaticity = ", X.aromaticity())
except ZeroDivisionError:
pass
print("Flexibility = ", X.flexibility())
try:
print("Secondary structure fraction = ",
X.secondary_structure_fraction())
except ZeroDivisionError:
pass

As The Above Code Produces The OutPut For All


The 775 proteins, we have attached only one of the
output screen.

You might also like