Biopython Coronavirus Notebook - Ipynb

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Biopython coronavirus notebook tutorial"
]
},
{
"metadata": {},
"source": [
"This basic tutorial shows you how to identify an \"Unknown sequence\" of
DNA/RNA, which happens to derive from a cornavirus genome (spoiler alert!). This
tutorial uses [Biopython](https://github.com/biopython/biopython) (calling some
tools) to identify and start to characterize this sequence."
]
},
{
"metadata": {},
"source": [
"## Setup"
]
},
{
"metadata": {},
"source": [
"Imports and print version information"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" import google.colab\n",
" # Running on Google Colab, so install Biopython first\n",
" !pip install biopython\n",
"except ImportError:\n",
" pass"
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Python version: sys.version_info(major=3, minor=8, micro=2,
releaselevel='final', serial=0)\n",
"Biopython version: 1.76\n"
]
}
],
"source": [
"import os\n",
"import sys\n",
"\n",
"from urllib.request import urlretrieve\n",
"\n",
"import Bio\n",
"from Bio import SeqIO, SearchIO, Entrez\n",
"from Bio.Seq import Seq\n",
"from Bio.SeqUtils import GC\n",
"from Bio.Blast import NCBIWWW\n",
"from Bio.Data import CodonTable\n",
"\n",
"print(\"Python version:\", sys.version_info)\n",
"print(\"Biopython version:\", Bio.__version__)"
]
},
{
"metadata": {},
"source": [
"Input file"
]
},
{
"metadata": {},
"outputs": [],
"source": [
"input_file = \"unknown-sequence.fa\"\n",
"\n",
"fasta_loc = (\"https://raw.githubusercontent.com/chris-rands/\"\n",
" \"biopython-coronavirus/master/unknown-sequence.fa\")\n",
"\n",
"if not os.path.exists(input_file):\n",
" urlretrieve(fasta_loc, input_file)"
]
},
{
"metadata": {},
"source": [
"## Basic genome properties"
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"Unknown_sequence\n"
]
}
],
"source": [
"for record in SeqIO.parse(input_file, \"fasta\"):\n",
" print(record.id)"
]
},
{
"metadata": {},
"source": [
"There is just a single sequence with header \"Unknown_sequence\". We are not
dealing with many chromosomes, scaffolds or contigs."
]
},
{
"metadata": {},
"source": [
"Extract the sequence"
]
},
{
"metadata": {},
"outputs": [],
"source": [
"record = SeqIO.read(input_file, \"fasta\")"
]
},
{
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA',
SingleLetterAlphabet())"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"record.seq"
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"Sequence length (bp) 29903\n"
]
}
],
"source": [
"print(\"Sequence length (bp)\", len(record))"
]
},
{
"metadata": {},
"source": [
"The sequence length is ~30Kb, if this sequence represents an individual
organism then it is very small. Far too small for a typical eukaryote and in fact
too short many microbes too (e.g. bacterial genomes are typically Mb). This
indicates that the genome could be from a virus."
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"GC content (%) 37.97277865097148\n"
]
}
],
"source": [
"print(\"GC content (%)\", GC(record.seq))"
]
},
{
"metadata": {},
"source": [
"The GC content is 0.38, so the sequence is somewhat AT-rich, but within a
'normal' range."
]
},
{
"metadata": {},
"source": [
"## Compare to other genome sequences"
]
},
{
"metadata": {},
"source": [
"Let's use BLAST to align the unknown sequence to other annoated sequences in
the NCBI nt database, which contains sequences from many different species from
accross the tree of life.\n",
"\n",
"This may take ~10 minutes since we are doing an online search against many
sequences (for larger queries, it would sensible to run BLAST locally instead; see
`Bio.Blast.Applications`)"
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"CPU times: user 155 ms, sys: 69 ms, total: 224 ms\n",
"Wall time: 2min 5s\n"
]
}
],
"source": [
"%%time\n",
"result_handle = NCBIWWW.qblast(\"blastn\", \"nt\", record.seq)"
]
},
{
"metadata": {},
"source": [
"Let's process the results with one of Biopython's generic parser"
]
},
{
"metadata": {},
"outputs": [],
"source": [
"blast_qresult = SearchIO.read(result_handle, \"blast-xml\")"
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"Program: blastn (2.10.0+)\n",
" Query: No (29903)\n",
" definition line\n",
" Target: nt\n",
" Hits: ---- -----
----------------------------------------------------------\n",
" # # HSP ID + description\n",
" ---- -----
----------------------------------------------------------\n",
" 0 1 gi|1798174254|ref|NC_045512.2| Wuhan seafood market
pn...\n",
" 1 1 gi|1805293633|gb|MT019531.1| Severe acute respiratory
...\n",
...\n",
" 3 1 gi|1802633808|gb|MN996528.1| Severe acute respiratory
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
...\n",
" 26 1 gi|1807816337|dbj|LC522974.1| Severe acute
respiratory...\n",
" 27 1 gi|1807816315|dbj|LC522972.1| Severe acute
respiratory...\n",
" 28 1 gi|1804153870|emb|LR757996.1| Wuhan seafood market
pne...\n",
" 29 1 gi|1804153869|emb|LR757995.1| Wuhan seafood market
pne...\n",
" ~~~\n",
...\n",
...\n",
...\n"
]
}
],
"source": [
"print(blast_qresult)"
]
},
{
"metadata": {},
"source": [
"Those descriptions are truncated, let's view them in full, for just the first
5 records"
]
},
{
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete
genome',\n",
" 'Severe acute respiratory syndrome coronavirus 2 isolate
BetaCoV/Wuhan/IPBCAMS-WH-03/2019, complete genome',\n",
BetaCoV/Wuhan/IPBCAMS-WH-01/2019, complete genome',\n",
" 'Severe acute respiratory syndrome coronavirus 2 isolate WIV04, complete
genome',\n",
SARS-CoV-2/Yunnan-01/human/2020/CHN, complete genome']"
]
},
"metadata": {},
}
],
"source": [
"[hit.description for hit in blast_qresult[:5]]"
]
},
{
"metadata": {},
"source": [
"Well that looks fairly conclusive, without doing any quantitative analyses,
it's already very likely we have a coronavirus genome here, specifically SARS2-CoV-
2 that was the cause of the COVID19 pandemic!"
]
},
{
"metadata": {},
"source": [
"Let's have a look at the first result in a bit more detail to check some of
the alignment metrics"
]
},
{
"metadata": {},
"outputs": [],
"source": [
"first_hit = blast_qresult[0]"
]
},
{
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome'"
]
},
"metadata": {},
}
],
"source": [
"first_hit.description"
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"0.0 53927.4\n"
]
}
],
"source": [
"first_hsp = first_hit[0]\n",
"print(first_hsp.evalue, first_hsp.bitscore)"
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"DNAAlphabet() alignment with 2 rows and 29903 columns\n",
"ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTC...AAA No\n",
"ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTC...AAA gi|1798174254|ref|
NC_045512.2|\n"
]
}
],
"source": [
"print(first_hsp.aln)"
]
},
{
"metadata": {},
"source": [
"The alignment appears of high quality and not merely a spurious hit"
]
},
{
"metadata": {},
"source": [
"We could view/save the full sequence alignment, for illustration purposes,
here is just the first 100 characters in FASTA format"
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
">No definition line\n",
"ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT\n",
"GTTCTCTAAACGAACTTTA\n"
]
}
],
"source": [
"print(first_hsp.aln.format(\"fasta\")[:100])"
]
},
{
"metadata": {},
"source": [
"## Extract annotations on the matching genome sequence"
]
},
{
"metadata": {},
"source": [
"Let's extract a bit more structured meta-data on the top matching sequence
homologous sequence using NCBI Entrez via Biopython to extract a GenBank file"
]
},
{
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'NC_045512.2'"
]
},
"metadata": {},
}
],
"source": [
"NCBI_id = first_hit.id.split('|')[3]\n",
"NCBI_id"
]
},
{
"metadata": {},
"outputs": [],
"source": [
"Entrez.email = \"A.N.Other@example.com\" # Always tell NCBI who you are"
]
},
{
"metadata": {},
"outputs": [],
"source": [
"handle = Entrez.efetch(db=\"nucleotide\", id= NCBI_id, retmode=\"text\",
rettype=\"gb\")"
]
},
{
"metadata": {},
"outputs": [],
"source": [
"genbank_record = SeqIO.read(handle, \"genbank\")"
]
},
{
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SeqRecord(seq=Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA',
IUPACAmbiguousDNA()), id='NC_045512.2', name='NC_045512', description='Severe acute
respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome',
dbxrefs=['BioProject:PRJNA485481'])"
]
},
"metadata": {},
}
],
"source": [
"genbank_record"
]
},
{
"metadata": {},
"source": [
"There's a lot of information in the genbank record if you know where to find
it..."
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"Is it single or double stranded and a DNA or RNA virus?\n",
" ss-RNA\n"
]
}
],
"source": [
"print(\"Is it single or double stranded and a DNA or RNA virus?\\n\",\n",
" genbank_record.annotations[\"molecule_type\"])"
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"What is the full NCBI taxonomy of this virus?\n",
" ['Viruses', 'Riboviria', 'Nidovirales', 'Cornidovirineae', 'Coronaviridae',
'Orthocoronavirinae', 'Betacoronavirus', 'Sarbecovirus']\n"
]
}
],
"source": [
"print(\"What is the full NCBI taxonomy of this virus?\\n\",\n",
" genbank_record.annotations[\"taxonomy\"])"
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"What are the relevant references/labs who generated the data?\n",
"\n",
"location: [0:29903]\n",
"authors: Wu,F., Zhao,S., Yu,B., Chen,Y.-M., Wang,W., Hu,Y., Song,Z.-G.,
Tao,Z.-W., Tian,J.-H., Pei,Y.-Y., Yuan,M.L., Zhang,Y.-L., Dai,F.-H., Liu,Y.,
Wang,Q.-M., Zheng,J.-J., Xu,L., Holmes,E.C. and Zhang,Y.-Z.\n",
"title: A novel coronavirus associated with a respiratory disease in Wuhan of
Hubei province, China\n",
"journal: Unpublished\n",
"medline id: \n",
"pubmed id: \n",
"comment: \n",
"\n",
"location: [0:29903]\n",
"authors: \n",
"consrtm: NCBI Genome Project\n",
"title: Direct Submission\n",
"journal: Submitted (17-JAN-2020) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA\n",
"medline id: \n",
"pubmed id: \n",
"comment: \n",
"\n",
"location: [0:29903]\n",
"authors: Wu,F., Zhao,S., Yu,B., Chen,Y.-M., Wang,W., Hu,Y., Song,Z.-G.,
Tao,Z.-W., Tian,J.-H., Pei,Y.-Y., Yuan,M.L., Zhang,Y.-L., Dai,F.-H., Liu,Y.,
Wang,Q.-M., Zheng,J.-J., Xu,L., Holmes,E.C. and Zhang,Y.-Z.\n",
"title: Direct Submission\n",
"journal: Submitted (05-JAN-2020) Shanghai Public Health Clinical Center &
School of Public Health, Fudan University, Shanghai, China\n",
"medline id: \n",
"pubmed id: \n",
"comment: \n",
"\n"
]
}
],
"source": [
"print(\"What are the relevant references/labs who generated the data?\\n\")\
n",
"for reference in genbank_record.annotations[\"references\"]:\n",
" print(reference)"
]
},
{
"metadata": {},
"source": [
"Now we can read up more about the virus and source data through following
these references and appropriate google searches."
]
},
{
"metadata": {},
"source": [
"Note that from this id, we could also find the [record
here](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/) on the NCBI website."
]
},
{
"metadata": {},
"source": [
"## Protein level analyses"
]
},
{
"metadata": {},
"source": [
"We might be interested in the gene/protein sequences, not just the entire
genome. It is possible to retrieve the protein coding sequences (CDSs) from the
Genbank record"
]
},
{
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"47"
]
},
"metadata": {},
}
],
"source": [
"len(genbank_record.features)"
]
},
{
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{\"3'UTR\", \"5'UTR\", 'CDS', 'gene', 'mat_peptide', 'source',
'stem_loop'}"
]
},
"metadata": {},
}
],
"source": [
"{feature.type for feature in genbank_record.features}"
]
},
{
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12"
]
},
"metadata": {},
}
],
"source": [
"CDSs = [feature for feature in genbank_record.features if feature.type
== \"CDS\"]\n",
"len(CDSs)"
]
},
{
"metadata": {},
"source": [
"Let's look at the first protein and extract the underlying sequence"
]
},
{
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['orf1ab']"
]
},
"metadata": {},
}
],
"source": [
"CDSs[0].qualifiers[\"gene\"]"
]
},
{
"metadata": {},
"outputs": [],
"source": [
"protein_seq = Seq(CDSs[0].qualifiers[\"translation\"][0])"
]
},
{
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Seq('MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLV...VNN')"
]
},
"metadata": {},
}
],
"source": [
"protein_seq"
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"Does the sequence begin with a start codon?\n",
" True\n"
]
}
],
"source": [
"print(\"Does the sequence begin with a start codon?\\n\",\n",
" protein_seq.startswith(\"M\"))"
]
},
{
"metadata": {},
"source": [
"We can check roughly how this protein sequence relates to the underlying
nucleotide sequence by looking at the DNA codon table."
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"Table 1 Standard, SGC0\n",
"\n",
" | T | C | A | G |\n",
"--+---------+---------+---------+---------+--\n",
"T | TTT F | TCT S | TAT Y | TGT C | T\n",
"T | TTC F | TCC S | TAC Y | TGC C | C\n",
"T | TTA L | TCA S | TAA Stop| TGA Stop| A\n",
"T | TTG L(s)| TCG S | TAG Stop| TGG W | G\n",
"--+---------+---------+---------+---------+--\n",
"C | CTT L | CCT P | CAT H | CGT R | T\n",
"C | CTC L | CCC P | CAC H | CGC R | C\n",
"C | CTA L | CCA P | CAA Q | CGA R | A\n",
"C | CTG L(s)| CCG P | CAG Q | CGG R | G\n",
"--+---------+---------+---------+---------+--\n",
"A | ATT I | ACT T | AAT N | AGT S | T\n",
"A | ATC I | ACC T | AAC N | AGC S | C\n",
"A | ATA I | ACA T | AAA K | AGA R | A\n",
"A | ATG M(s)| ACG T | AAG K | AGG R | G\n",
"--+---------+---------+---------+---------+--\n",
"G | GTT V | GCT A | GAT D | GGT G | T\n",
"G | GTC V | GCC A | GAC D | GGC G | C\n",
"G | GTA V | GCA A | GAA E | GGA G | A\n",
"G | GTG V | GCG A | GAG E | GGG G | G\n",
"--+---------+---------+---------+---------+--\n"
]
}
],
"source": [
"print(CodonTable.unambiguous_dna_by_id[1])"
]
},
{
"metadata": {},
"source": [
"However, we can't perform an exact \"reverse translation\" of course, since
several amino acids are produced by the same codon. Note that if instead we started
with the nucleotide sequence, then we could use Biopython's `.transcribe()` and
`.translate()` functions to convert sequences from DNA to RNA and DNA to protein
respectively."
]
},
{
"metadata": {},
"outputs": [
{
"name": "stdout",
"text": [
"Protein sequence length in amino acids 7096\n"
]
}
],
"source": [
"print(\"Protein sequence length in amino acids\", len(protein_seq))"
]
},
{
"metadata": {},
"source": [
"It is a long protein for a virus. Let's check the annotation"
]
},
{
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['orf1ab polyprotein']"
]
},
"metadata": {},
}
],
"source": [
"CDSs[0].qualifiers[\"product\"]"
]
},
{
"metadata": {},
"source": [
"So it looks like this is a polyprotein, which explains why it is a relatively
long protein. Polyproteins are a typical feature of some viral genomes where
smaller proteins are joined together, providing a particular organization of the
viral proteome."
]
},
{
"metadata": {},
"source": [
"## What's next?"
]
},
{
"metadata": {},
"source": [
"Logical next steps at the genome level might include building a multiple
sequence alignment from many cornavirus genomes (checkout the Biopython
wrapers/parsers for `Clustal` and `Mafft` and `Bio.Align`/`Bio.parirwise2` plus
`Bio.AlignIO`), building a phylogeny with an external tool like
[iq-tree](http://www.iqtree.org/) and then viewing the tree with `Bio.Phylo`, the
[ete3 toolkit](http://etetoolkit.org/), or [Jalview](https://www.jalview.org/).\n",
"\n",
"Other protein level analyses could involve including building protein trees,
annotating the proteins and vizulisation (e.g. `Bio.Graphics`), doing evolutionary
rate analyses (e.g. `Bio.Phylo.PAML `), looking at protein structure, clustering
and much more.\n",
"\n",
"These kind of analyses can provide useful biological and epidemiological
information, very important for this coronavirus in an outbreak situation. For
example, allowing tracking of how the outbreak spreads and indicating appropriate
infection control measures, although caution in the inturpretation of results is
always required. See [Nextstrain](https://nextstrain.org/ncov) for an excellent
resource of this kind."
]
},
{
"metadata": {},
"source": [
"Note, there is tons of other functionality in Biopython, this is just a very
small fraction of the modules, see [Peter Cock's Biopython
workshop](https://github.com/peterjc/biopython_workshop) and the extensive
[official tutorial
documentation](http://biopython.org/DIST/docs/tutorial/Tutorial.html)."
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Biopython Coronavirus Notebook - Ipynb

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biopython Coronavirus Notebook - Ipynb

Uploaded by

Copyright:

Available Formats

{

You might also like