You are on page 1of 11

Structure and Function of SARS-CoV-2 Spike Protein:

A Multiple Sequence Alignment (MSA) Study


Written by Fiona Wood, Abigail Diering, and Dr. Mary Peek
School of Chemistry and Biochemistry | Georgia Institute of Technology | Fall 2020

INTRODUCTION

SARS-CoV-2: The ongoing COVID-19 pandemic has been one of the most serious worldwide
pandemics in modern history, infecting 22.2 million people worldwide and causing 783,000
deaths as of August 19,2020. COVID-19 is caused SARS-CoV-2, a virus belonging to the family
coronaviridae (coronaviruses). These viruses have a single-stranded RNA genome and are
characterized by the “corona” of protein spikes surrounding the viral capsid.

Figure 1: Electron microscope image of avian coronavirus particles, showing the characteristic
spike proteins as a series of club-like projections surrounding the main viral capsule. (Source:
CDC Public Health Image Library)

NCBI: The National Center for Biotechnology Information (NCBI) is a collection of public
databases maintained by the National Institutes of Health (NIH). NCBI is primarily used to
store and distribute sequence information for genes, genomes and proteins. Scientists from
around the world can deposit and retrieve this information and can use it to perform their own
experiments and data analyses.

1
BLAST: BLAST stands for Basic Local Alignment Search Tool. It is an algorithm for quickly and
efficiently searching the millions of sequence entries in the NCBI databases and retrieving those
which have the highest similarity to an input sequence.

Teaching and Learning Goals

Upon completing this laboratory session, students will be able to:

• Obtain genome, gene, and protein data from the NCBI public database
• Use BLAST to obtain sequences of genes/proteins similar to a reference
• Align multiple gene/protein sequences to determine conserved features within a gene
family

MATERIALS

• Computer (Windows or Mac)


• Web Browser. Preferred: Chrome or Firefox, or Safari for Mac
• Text Editing Software:
o For Windows: Notepad (installed by default)
o For Mac: TextEdit
• Any MSA Viewer. Download this program before class. Recommended:
o AliView http://ormbunkar.se/aliview/

o SeaView http://doua.prabi.fr/software/seaview

EXPERIMENTAL PROCEDURES
A. Reference File Preparation

1. Use a web browser to navigate to the NCBI databases: http://www.ncbi.nlm.nih.gov

2
2. Use the search box at the top of the screen to search for “SARS-CoV-2”. This will search
all the NCBI databases for information about the novel coronavirus. The databases on
the search page are organized into six categories:
a. Literature: Scientific publications
b. Genes: Summaries of published information about specific genes
c. Proteins: Protein sequence data and related info
d. Genomes: Genome sequence data and related info
e. Genetics: Documentation of known variants of genes
f. PubChem: Biochemical info

3. In the “Genes” category, click on “Gene” to view the results from the Gene database.
This contains organized entries for individual genes from various organisms.

4. Select “surface glycoprotein” in the search results. This is the gene encoding the
coronavirus spike or S protein. This will bring you to a page summarizing the
information about the gene in all the NCBI databases:

5. Scroll down until you find the section titled “NCBI Reference Sequences (RefSeq)”.
These are the reference nucleotide and protein sequences for the gene. The “Genomic”
subsection contains the reference nucleotide sequence from the genome, while the
“mRNA and Protein(s)” section contains the reference protein sequence.

3
6. To obtain the nucleotide sequence, click on “FASTA” next to the “Download” heading.
This will take you to a page showing the nucleotide sequence of the gene in FASTA
format.
a. To download the sequence, in the upper right-hand corner of the page click on
“send to”, select “file”, make sure the selected format is “FASTA”, then click
“create file”. This will download a file called “sequence.fasta”.

b. Rename this file to something more informative before moving on.

7. To obtain the protein sequence, go back to the gene page and click on the first link under
the “mRNA and Protein(s)” heading. This is the accession number for the protein
sequence and will take you to the protein database entry for the gene.
a. Repeat step 6a on this page to download the sequence. Make sure the format is
“FASTA” before you download.
b. Rename the resulting file to something more informative before you continue.

B. BLAST

8. Go to the BLAST website: https://blast.ncbi.nlm.nih.gov/Blast.cgi

4
Four different basic local alignment search tools are available:
• Nucleotide BLAST – searches nucleotide database
• Protein BLAST – searches protein sequence database
• blastx - translates a nucleotide sequence into protein and searches against the protein
database
• tblastn – deduces a nucleotide sequence from a protein sequence and searches the
nucleotide database

9. Click on the “Nucleotide BLAST” option to see a screen like this:

For more information about each of the options in the figure above, refer to Box 1 below.

Box 1: BLAST Options


The options for performing the BLAST are organized into three main regions:

Enter Query Sequence


Here, you may enter a sequence of interest, also known as the “query”. There are
multiple ways to enter the query:
a. Copy the sequence into the box using one-letter nucleotide or amino
acid codes
b. Enter the sequence ID of the gene or protein you want to search into
the box, if it already exists within the database.
c. Upload a file containing the sequence data in FASTA format

You can also specify a subrange of the entered sequence to search using the
“Query Subrange” options and give the search a descriptive title using the
“Job Title” option.

5
Box 1 continued

Choose Search Set


Allows you to narrow down which results you want from the search. The options in
this box are as follows:
a. “Database” – this allows you to choose a sequence database to search.
By default, BLAST will search “nr”, which includes all non-redundant
nucleotide/protein sequences. Other options include …

b. “Organism” – this allows you to specify which species/strains are


allowed or disallowed in the search.

c. “Exclude” – these options allow you to remove types of samples, such


as environmental samples, from the possible results.

d. “Limit to” (nucleotide only) – By selecting this option, only sequences


from “type specimens” will be included in the search. This will exclude
any variants, environmental samples, clones, etc. from the results

e. “Entrez Query” (nucleotide only) – This allows you to further refine


your results based on categorizations in the database – e.g. molecule
type, sequence length, etc.

Program Selection
Allows you to choose the algorithm used to perform the search. There are a number of
variations on the BLAST algorithm which are optimized for different purposes; for
example, PSI-BLAST, a type of protein BLAST, searches the database iteratively to
collect more distantly related sequences that match the input pattern of protein
features. In most cases, the default options (MEGABLAST for nucleotide, blastp for
protein) are best for quickly searching the databases for similar sequences.

10. Select “Choose File” in the “Enter Query Sequence” box.


11. Upload your nucleotide sequence file from the previous section.
12. Click the “BLAST” button at the bottom of the screen and wait for the job to finish. (It
should take no more than a few minutes.) When it does, you will see a screen that looks
like this:

6
13. The first part of the screen simply gives details of the search as well as options for
filtering the results. The actual results are shown in the second part of the screen:

14. The important columns to note from the figure in Step 13 are:
• Description tells you the name of the gene/protein as well as the organism it comes
from.
• Query Cover tells you how much of the query matches the result, also called the
subject.

7
• E value is a general score of how well the query and the subject match – smaller
numbers mean the result is a better match.
• Per. Ident. (percent identity) tells you how many of the bases/amino acids are
identical between the query and the subject.

You may note that all or almost all of the results of your BLAST come from SARS-CoV-2
genomes and are 100% identical to your query. This is not very useful for gaining
insight into the structure or evolution of the spike protein, so you will need to filter out
SARS-CoV-2 from your results.

15. Go back to the BLAST page and enter “SARS-CoV-2” into the “organism” box in the
“Choose Search Set” section, and make sure the “exclude” box next to it is checked.
Then click “BLAST” again to redo the BLAST search.

16. Scroll through your results and un-check any that are labeled “synthetic construct”,
“recombinant”, “clone” or similar. These are artificial nucleotide sequences which will
not be useful for our data analyses.

17. In the top bar, select “Download”, then “FASTA (aligned sequences)”. This will
download all the selected sequences into a single FASTA file called “seqdump.txt”.
Again, rename this to something more informative before continuing.

18. Go back to the main BLAST page, but now select “Protein BLAST”. The options for the
protein BLAST are very similar to those for the nucleotide BLAST (see Box 1). Note: the
search can take anywhere from 3-7 minutes but will show a loading screen that refreshes every
few seconds if it is working.

19. Repeat the appropriate steps above with your protein sequence file.

When you download the protein file, select “FASTA (complete sequences)” instead of
“FASTA (aligned sequences)” to download the entire protein sequence for each record
instead of just the aligned portion. (The reason we didn’t do this for the nucleotide
sequences is because they mostly come from genome data, so downloading the complete
sequences would’ve downloaded the entire genome of each sample rather than just our
desired gene!)

C. Multiple Sequence Alignment

We will be using Multiple Alignment by Fast Fourier Transform (MAFFT), an online alignment
program, to perform multiple sequence alignments.

20. Before you start:

8
a. Open your sequence files from the previous steps using a text editing program
(Notepad, TextEdit, etc. NOT Word). The FASTA file format can be read in a text
editing program but you may have to manually select 'open with' to do so.

b. Copy your SARS-CoV-2 nucleotide sequence and paste it at the beginning of the
collection of nucleotide sequences you obtained in the last section. Save the file.

c. Repeat Step 20b with your SARS-CoV-2 protein sequence and your collection of
protein sequences from the previous section.

21. Go to the MAFFT website at https://mafft.cbrc.jp/alignment/server/

22. Select “Choose File” and upload the nucleotide sequences that you just edited.

23. Make sure all the options in the second box have the “same as input” option selected.
This minimizes formatting differences between the input and output files.

24. Click “Submit” and wait for the alignment to finish. This should take no more than a
few minutes. Once the alignment is processed, you will be redirected to a page that
looks like this:

25. The default output format for MAFFT is a format called “Clustal”. To download a
FASTA formatted file containing the alignment, select “Fasta format” at the top of the
page. (Note that as usual, the file will have a strange name by default and should be
renamed before you continue.)

9
26. Open your preferred alignment viewer (e.g. AliView or SeaView) and drag the file you
just downloaded into the window. You should see something like this:

Note that each nucleotide is displayed in a different color, so you can easily see which
sequences are aligned and which are not.

27. Repeat Steps 22-26 using your protein sequences. When you open the protein alignment
in SeaView, it looks something like this:

Note that the amino acids in this alignment are colored based on their properties. Make
sure to pay attention to the actual text of the alignment to spot mutations.

10
DATA ANALYSIS
Nucleotide Sequences:

• From your BLAST results (on the NCBI website), which coronavirus sequence is most
similar to SARS-CoV-2?

• Find this sequence in your alignment. How many mutations are present between it and
SARS-CoV-2?
o Hint: In SeaView you can change the order of sequences by highlighting the
sequence name, then ctrl-click on another sequence name to move the
highlighted sequence next to it.

o In AliView you can drag and drop the sequences

• You should have at least one SARS coronavirus sequence in your alignment. Compare
this sequence to SARS-CoV-2, as you did for the previous sequence.

• Using the SeaView or AliView results:


o Which regions of the alignment are highly conserved across all the sequences in
your alignment?
o Which regions are not well conserved?
o What might this mean for the function of the protein product?

Protein Sequences:

• Repeat the analysis of the nucleotide sequences with your protein data.
o Are your results the same?
o Why or why not?

• Perform a literature search to find the domains of the Coronavirus Spike protein.
Identify the domains in your alignment.
o Which domain is the most/least highly conserved? Why do you think this is?
o Based on your reading, which mutation(s) in SARS-CoV-2 do you think has the
greatest impact on the function of the protein? Explain your reasoning.

11

You might also like