You are on page 1of 22

Principles of Protein Structure: Primary, Secondary and Tertiary Structure

To understand the basic principles of protein structure and the potential of its use in various areas of research, academic or industrial, like within pharmacological industry and biotechnology, we need first to learn how the four levels of protein structures are related to each other. The first level is the amino acid sequence, often called its primary structure, perhaps to stress the fact that the sequence to a high degree defines the tertiary protein structure. The next level is the secondary structure, the primary building blocks of the tertiary protein structure. Secondary structure elements are combined into different structural motifs, or protein motifs, sometimes also called super-secondary structures. These in turn build up the separate folding units, domains. Some tertiary protein structures are constituted of a single domain, others contain several domains with similar or different folds. Protein domains are the basic classification unit of protein folds. Although protein structures are usually deposited with the Protein Data Bank (PDB), two major databases are dedicated to protein structure fold classification, SCOP and CATH. And finally we have the level of quaternary protein structure, that is when several polypeptide chains, or also called subunits, come together into a large macromolecular complex.

An example of a quaternary protein structure. The figure shows a cryo-electron microscopic (cryo EM) reconstruction of the 24-subunit oligomer of yeast frataxin. Frataxin is a mitochondrial protein involved in iron storage, detoxification and iron delivery to various processes like iron-sulfur cluster synthesis and heme synthesis in mitochondria. For references see Karlberg et al., 2006 and Schagerlof et al, 2008.

As outlined in the summary page of this chapter, I will discuss the following aspects of protein structures and provide some tutorials on the use of the various databases:
The 20 amino acid residues Secondary structure: Helices and sheets, Torsion angles and the Ramachandran plot Protein motifs Protein folds and fold classification: Class and fold, Topology (superfamily), protein domains, Internet resources and databases related to protein structure Protein DataBank files

The currently available protein structures in the Protein Data Bank have been classified into more than 1000 different unique folds, and discussing all of them here is apparently impossible. However, all we need to do is to understand the basic principles, the rest will be easier. We start with the relationship between the tertiary protein structure and the amino acid sequence. This relationship is rather complicated since a huge variations in the amino acid sequence can be tolerated within a particular type of a tertiary structure. I say "type of structure", but the more correct way of describing this would be by saying "the type of fold". In many cases only after the determination of the protein structure, its relation to a certain protein family may be revealed. For example, the anaerobic cobaltochelatase, the structure of which was determined some years ago (Schubert et al., 1999) unexpectedly demonstrated high degree of similarity to the structure of ferrochelatase (Al-Karadaghi et al., 1997), although there is only 11% sequence identity between the two proteins. This number is much smaller than the accepted homology-threshold in sequence alignment (around 20-25%).

The interdomain contact area in the structure of elongation factor G (EF-G). Al-Karadaghi et al., 1996, Structure 4, 555-565.

The 20 Amino Acids: Hydrophobic, Hydrophilic, Polar and Charged Amino Acids and their Function in Protein Structures
Amino acids are put together into a polypeptide chain on the ribosome in a process called protein synthesis, or translation. During this process the polypeptide bond, the covalent bond between two amino acid residues, is formed. There are 20 amino acids most commonly occurring in nature. Each of them has its special properties, which are defined by the type of the side chain they possess. It is the side chain which makes each of the 20 amino acids unique and provides it with a role to play in protein structure. Depending on the particular properties of the amino acid residue, for example its propensity to be in contact with a polar solvent like water, it is classified into one of the following classes: hydrophobic, polar or charged. The charged amino acid residues include lysine (+), arginine (+), aspartate (-) and glutamate (-). Polar amino acids include serine, threonine, asparagine, glutamine, histidine and tyrosine. The hydrophobic amino acids include alanine, valine, leucine, isoleucine, proline, phenylalanine, tryptophane, cysteine and methionine, The amino acid glycine does not have a side chain and is hard to assign to a certain class. However, glycine is often found on the surface of the protein tertiary structure in loop regions and provides additional flexibility to these regions. In contrast, proline provides rigidity to the protein structure, by imposing certain torsion angles on the segment of the polypeptide chain where it is located. Together with proline, glycine is often highly conserved within a certain protein family. The reason is that the special properties of these two amino acids distinct them from other amino acids and are important for preserving the type of the protein tertiary structure within a protein family. We will come back to this question later when we will discuss torsion angles and the Ramachandra plot. Below the 2o most common amino acids in proteins are listed with their three-letter and one-letter codes:

The Twenty Amino Acids Most Common in Proteins:


Charged:
Arginine - Arg - R Lysine - Lys - K Aspartic acid - Asp - D Glutamic acid - Glu - E

Polar (may participate in hydrogen bonds):


Glutamine - Gln - Q Asparagine - Asn - N Histidine - His - H Serine - Ser - S Threonine - Thr - T Tyrosine - Tyr - Y Cysteine - Cys - C Methionine - Met - M Tryptophan - Trp - W

Hydrophobic (normally buried inside the protein core):


Alanine - Ala - A Isoleucine - Ile - I Leucine - Leu - L Phenylalanine - Phe - F

Valine - Val - V Proline - Pro - P Glycine - Gly - G The figure below shows the distribution of the 20 amino acid residues within the protein tertiary structure:

The graph above nicely demonstrates the location of the 20 amino acids in different regions of a protein tertiary structure. The vertical axis shows the fraction of highly buried within the protein core (inaccessible for water) amino acid residues, while the horizontal axis shows the amino acid names in one-letter code. Apparently there is very small fraction of buried charged residues, while in the case of the non-polar amino acids the fraction is very high. Taken from the tutorial by J.E. Wampler, The propensity of amino acid residues to be (or not to be) in contact with polar solvent largely controls the distribution of each of the 20 amino acids within the volume of a protein structure. Thus, most protein molecules have a hydrophobic core, which is not accessible to solvent and is built up by hydrophobic amino acids. On the other hand, polar and charged amino acids preferentially cover the surface of the molecule and are in contact with the solvent. Very often they also interact with each other: positively and negatively charged amino acids form so called salt bridges between each other, while polar amino acid side chains get involved in hydrogen bonding with side chains or main chain atoms and with water. Since these interactions are crucial for the stabilization of the protein tertiary structure, they are normally conserved within a protein family. A detailed atlas of hydrogen bonding for all 20 amino acids in protein structures was compiled by Ian McDonald and Janet Thornton and can be found here. In the next chapter we will go through the next level of protein structure hierarchy: the secondary structure.

Protein Secondary Structure: Alpha Helices and beta Sheets and Hydrogen Bonds
Descriptions of the basic principles of protein structure, including the secondary structure, may be found in many good quality biochemistry books. Here I will focus on the general aspects, which are important to keep in mind, for example, when analyzing a sequence alignment, when making a homology model and analyzing model quality, when planing some mutations in you protein or when analyzing the interactions of your ligand with a protein. The most common type of secondary structure in proteins is the alpha-helix. Linus Pauling was the first to predict the existence of the alpha-helix, which was confirmed with the determination of the first three-dimensional structure of myoglobin by Max Perutz and John Kendrew. An example of an alpha-helix is shown on the left figure below. This type of

representation of a protein structure is called sticks representation. To give you a better impression of how a helix looks like, I removed the amino acid side chains in this figure. Notice how all the carbonyl oxygen atoms (shown in red) point in one direction, towards the C-terminal end of the helix. To be more exact, they actually point towards the amide nitrogen of the amino acid, which is 4 residues away in the sequence. Together they form a hydrogen bond, one of the main factors in the stabilization of secondary structure in proteins.

On the right the same alpha-helix is shown, but with dashed lines representing the hydrogen bonds. For a hydrogen bond to be formed two electronegative atoms (in the case of an alpha-helix the amide N, and the carbonyl O) have to interact with the same hydrogen. The hydrogen is covalently attached to one of the atoms (called the hydrogen-bond donor), but interacts electrostatically with the other (the hydrogen bond acceptor, O). In proteins almost all groups capable of forming H-bonds (both main chain and side chain atoms), independently of whether the residues is within a secondary structure or some other type of structure, are usually H-bonded to each-other or to water molecules. Due to their electronic structure, water molecules may accept 2 hydrogen bonds, and donate 2, thus being simultaneously engaged in the total of 4 hydrogen bonds. Water may also be involved in the stabilization of secondary structure. It is useful to remember that the energy of a hydrogen bond, depending on the distance between the donor and the acceptor and the angle between them, is in the range of 2-10 kcal/mol. Hydrogen bonds also stabilize another type of secondary structure in proteins, namely the beta-sheets. An example of a beta-sheet with the stabilizing hydrogen bonds is shown on the figure below:

As you may see from this figure, the hydrogen bonds in this case are between different stretches of the structure. By other words, they are not formed between residues adjacent to each other, as in the case of an alpha-helix. Rather, different stretches of the amino acid sequence of the protein form a beta-sheet. Each stretch of sequence in a beta-sheet is called a beta-strand. Thus, a beta-sheet consists of several beta-strands, kept together by a network of hydrogen bonds. The same beta-sheet is shown on the figure below, this time in a so-called "ribbon" representation, and in the contexts of the protein structure to which it belongs (protein colored according to secondary structure, yellow-beta-sheets and magenta-helices). The arrows show the direction of the beta-sheet, which is from the N-terminus to the C-terminus. When the arrows point in the same direction, we call such a beta-sheet parallel, and when they point in opposite directions, the beta-sheet is called anti-parallel.

In the next figure you can see an example of a protein structure with an anti-parallel beta-sheet. Notice that the arrows point in different directions in this case.

When we just have 2 beta-strands anti-parallel to each other, like in the figure below, we call this secondary structure a beta-hairpin. The part between the two beta-strands is called a loop. Loops are considered to be one of the secondary structure types in proteins. They play an important role, connecting together beta-strands or strands to alpha-helices, or helices to each other. Loop sequence within a particular protein family may be very variable. For this reason, as it is discussed in the sequence alignment and homology modeling parts, when aligning homologous sequences, we try to localize insertions and deletions to loop regions. There are different types of loops, depending on the types of the amino acids there, and on the torsion angles within a loop, but for our purposes we dont need to go into these details now.

A beta-hairpin You may always return to the summary page of the protein structures chapter, if you would like to jump to some other parts. In the next section I will discuss an important characteristic of the secondary structure of proteins, the torsion angles and the Ramachandran plot.

Torsion Angles and the Ramachandran Plot, Allowed and Disallowed Regions
Next part of this overview will be on torsion angles, which are also called Ramachandran angles (RAMACHANDRAN GN, RAMAKRISHNAN C, SASISEKHARAN V., J Mol Biol., 7:95-99). The two torsion angles phi and psi describe the rotation of the polypeptide around the two bonds on both sides of the Ca atom:

Different secondary structure elements have their characteristic torsion angles, which can be visualized using the Ramachandran plot (on the left plot the region marked alpha is for alpha-helices and the beta is for betasheet regions):

Each dot on the Ramachandran plot shows the phi and psi values for an amino acid in the protein in question. The horizontal axis are phi value, while the vertical shows psi values. Notice that the counting in the left hand corner starts from -180 degrees and extends to +180 for both axes. This is a convenient presentation and allows clear distinction of the characteristic regions of alpha-helices and beta-sheets. The regions on the plot with the highest density of dots are the so called allowed regions of the Ramachandran plot, also called low-energy regions. Some values of phi and psi can be forbidden for an amino acid to adopt. The reason is that for some torsion angles the atoms within the polypeptide chain will come too close to each other, a so-called "steric clash", which would result in very high energy of the system. On the Ramachandran plot such regions can be easily distinguished: For a high-quality experimental structure these regions are simply empty or almost empty. Very few amino acid residues in a protein have their torsion angles within these regions. But there are exclusions from this rule. Sometimes such values can be found and they most probably will result in some strain in the polypeptide chain. In such cases additional interactions must be present to stabilize such structures. They may have functional significance and may be conserved within a protein family, which may be related to possible conformational dynamics of the structure, as suggested by Pal and Chakrabarti, 2002. Another exception from the principle of clustering around the alpha- and beta-regions can be seen on the right plot of the above figure. In this case the Ramachandran plot shows torsion angle distribution for one single residue, glycine. Glycine does not have a side chain and, as mentioned earlier in the discussion of the basics principles of protein structure, it provides high flexibility to the polypeptide chain. By other words, it may adopt torsion angles, which are normally not allowed for other amino acids. That is why glycines are often found in loop regions, where the polypeptide chain makes a sharp turn. This is also the reason for the high conservation of glycine residues, since turns are important for the preservation of the particular fold of the protein structure. Theoretically, the average phi and psi values for alpha-helices and beta-sheets should be clustered around

-57, -47 and -80, +150, respectively. However, for real experimental structures these values were found to be different. A detailed discussion of the fine structure of phi- and psi-angle distribution in the Ramachandran plot can be found in the work by Hovmller at al., 2002. The Ramachndran plot and the quality of a protein structure In cases when the protein X-ray structure was not properly refined, and even for bad or wrong homology models, we may find torsion angles in disallowed regions of the Ramachandran plot. Usually these deviations indicate problems with the structure, suggesting that the Ramachandran plot may be used in assessing the quality of experimental structures or structures built using homology modeling. The figure below shows two Ramachandran plots for the same protein structure. Red regions indicate low-energy regions, brown allowed regions, yellow the so-called generously-allowed regions and pale-yellow marks disallowed regions. The structure used to generate the plot on the left is at 2.9 and was not properly refined (it is an old structure from the early days of protein crystallography). The structure used to generate the plot on the right is more recent and was refined against 1.8 resolution X-ray data. You may notice that the torsion angles on the left plot lack real clustering around secondary structure regions and show a much wider distribution, compared the the plot on the right (also compared to the left plot on the figure above). There are also many dots in the disallowed regions on the left plot and almost none on the right (the ones which are seen are for glycine residues):

Thus, the Ramachandran plot provides valuable information on the quality of a protein structure and gives some additional insights into the relationships between the amino acid sequence and the three-dimensional protein structure. There will be further discussions of other aspects related to the quality of experimental protein structures, and the quality of homology models. In the next section we will move to discuss common motifs. Using the following link you may leave the Ramachandran plot part and return to the summary of the protein structures chapter.

Protein Motifs and Protein Domains


In the previous section I discussed the secondary structure of proteins, helices and sheets. The next level of protein structure can be described as the protein motif level, sometimes also called super-secondary structure. We all know that protein structures contain a combination of secondary structure: Helices and strands, which are connected to each other and combined in many different ways. Although, we shold remember that there is a limited number of possible ways of arranging secondary structure elements into a protein motif. Here I will discuss some of the basic motifs, primarily by showing pictures. In the future you should be able to distinguish such motifs by analyzing a 3D structure, for example by using a graphics display program, like SwissPDB viewer (Deep View) and information from the respective protein structure databases. Probably the simplest type of a protein motif structure can be seen in a helical bundle, shown on the schematic view below:

Helix bundles are very common in protein structures and are very often found as separate domains within larger, multidomain structures. Another common protein motif is shown below. This is a parallel beta-sheet. You may notice that the connections between the strands in this case are not of the hairpin type, described earlier. Sometimes we may find long coil regions, or even an alpha-helix connecting the strands in a beta-sheet.

The alpha-helix connection in a parallel beta-sheet structure is shown below:

In the example below the srands of the beta-sheet are not parallel any more. You may also notice that between some of the strands the connection is of a hairpin type:

Other types of connections between strands build the protein motifs shown below:

At a later stage I will finish this part with some biologically relevant examples showing a possible role of a protein motif in protein function. In the next section I will discuss the primary databases related to protein structure. And if you would like to jump to another page, the easiest would be to go back to the outline page of the protein structures chapter.

Protein folds and protein fold classification


Before going into the details of protein fold classification one could ask: Why do we need to bother about this? In general terms fold assignment is one of the basic moments in any effort towards understanding protein structure and function. Protein fold assignment will often reveal evolutionary relationships, which sometimes are difficult to detect at a sequence level. In turn this may help in a better understanding of protein function, its biological activity and role in a living organisms. From the study of the relationships between sequence and structure we may also get deeper insights into the basic principles of protein structure and function, we may learn how to design new proteins with pre-defined activity, how to modify existing proteins in a direction we need, etc. I mentioned in the introduction to this chapter that the amino acid sequence determines the protein three-dimensional structure. I also mentioned that this relationship is not unique: different sequences, sometimes totally unrelated, may have similar 3D structures. By other words, the degree of conservation of the three-dimensional structure is much higher than the degree of conservation of the amino acid sequence. I discussed earlier sequence conservation and the methods we use for the assessment of the similarity between two sequences. But how do we compare protein 3D structures to find out if they are similar? And how different similar protein structures can be? Obviously, there should be some criteria, which can be used to judge the degree of similarity between protein structures, an important step in any protein fold assignment. A discussion on this subject will appear shortly in the homology modeling chapter. For now I would like to switch from talking about 3D structures and start talking about folds. Three-dimensional structures sometimes may differ substantially from each other, and still have the same type of protein fold. I have noticed that sometimes students have difficulties understanding what fold is. I would define a protein fold as a certain type of arrangement of secondary structure elements in space. I have actually already mentioned some folds in the previous page on super-secondary structures (protein motif). The 4-helix bundle, for example, is a fold. Or the TIM barrel fold of alternating helices and strands. In the figures below shown is the coenzyme-binding domain of some dehydrogenases. one of the most common protein folds, also called the Rossman fold. Michael G. Rossmann is a protein crystallographer who solved the first structure with this type of fold. It is also the only protein fold named after the person who was first to discover it:

In this figure on the left a schematic presentation of the Rossmann fold. On the right the nucleotide binding domain of liver alcohol dehydrogenase is shown Notice the parallel beta-sheet (shown in yellow). There are many types of protein folds of course, but how many? Taking into account the huge number of amino acid sequences, one would expect a high number of different folds. But in reality it is not like that. The number of folds is limited. Nature has re-used the same folding types again and again for performing totally new functions. Some people would refer to the common ancestor, from which all other organisms have originated. However, I am not going to discuss this now, may be sometime in the future. To find out how many folds are out there we can simply go to the Protein Databank (PDB) and click the PDB Statistics on the right upper corner. At the end of the page which will appear you may click one of the following two: Growth Of Unique Protein Classifications Per Year As Folds Defined By SCOP As Topologies Defined By CATH SCOP and CATH are the two databases generally accepted as the two main authorities in the world of fold classification. According to SCOP there are 1393 different folds. Also notice the graph, the last time a new fold was identified was 2008:

The next graph shows the folds identified by CATH database, a total of 1233 folds:

Apparently the two databases use slightly different fold definitions and protein fold classification, which results in a different number of protein folds for the same amount of protein structures. In any case, as mentioned in the outline of this chapter, knowing the protein fold is important in many cases, for example during homology modeling of a protein structure. The question now is: How do we identify a protein fold? What is the main folding unit? It is a domain, discussed on the next page.

Protein Domains and protein domain classification, the CATH database


When discussing protein fold we first need to identify the folding unit. Such unit is called a protein domain. This means that when we talk about fold classification we actually mean DOMAIN CLASSIFICATION. Protein domains are the basic building blocks of a protein structure and they are also the basic evolutionary unit of any structure. Of course a protein may consist of a single domain or may be a multi-domain protein. Certain protein domains have some function associated with them, like the Rossmann-type domain, also called coenzyme-binding domain, shown on the previous page. They carry this function with them when they get inserted into different proteins during evolution. A domain may be characterized by the following: 1- A spatially separated unit of the protein structure 2- May have sequence and/or structural resemblance to another protein structure or domain. 3- May have a specific function associated with it.

To characterize the folding of protein domains, we need to discuss the details of fold classification: How folds are defined by different databases, what are the relationships between a fold and the protein family, etc. For classification we usually need to follow the scheme: 1- Assignment of secondary structure. 2- Assignment of independent folding units: Domains. 3- Assignment of a structural class. 4- Assignment of fold (also called architecture). 5- Assignment of topology (superfamily) Secondary structure is usually assigned automatically, using some specific computer programs. For example, most of the protein structure visualization programs will do it, and usually all PDB files contain secondary structure assignment. For information on the fold of a protein domain we simply need to consult the CATH and SCOP databases. Although one needs to be aware that CATH and SCOP use slightly different terminology in domain assignment. CATH comes from the first letters in Class-Architecture-Topology-Homologous superfamily. For clarity I show below some examples of proteins consisted of one or several domains:

On the left is the structure of one of the subunits of hemoglobin and on the right is the structure of pyruvate kinase. The functional units of both proteins consist of 4 subunits, by other words they have a quaternary structure. In the case of hemoglobin this will make 4 domains, while for pyruvate kinase there will be 12 protein domains in the functional unit. A subunit of hemoglobin consists of a single alpha-helical type domain. You may also see the heme molecule (in sticks representation) bound within a pocket created by the helices. The domains in pyruvate kinase are well separated from each other. The top domain on the figure is built up by beta-sheets, while the other two domains contain a mixture of helices and sheets. For illustration, the figure below shows the quaternary structure of pyruvate kinase:

Protein domains may be assigned using automatic procedures, often in combination with manual inspection. In pyruvate kinase, for example, the domains are well separated from each other, but in many cases it may be difficult to separate them visually for an untrained person. In such cases the easiest would be to consult the CATH database, which gives a clear definition of the domains. For example, when I perform a search with the PDB ID I am using (1e0t) for pyruvate kinase, I get the following result:

The protein domains are organized in rows in the table above. There are 4 subunits (4 separate polypeptide chains) in the quaternary structure, and that is why we see in the Table above the designations, called chain identifiers, A, B, C, D. For example, in 1e0tA01we first have the PDB entry code, followed be the chain identification (A) and the domain number (01), as it is numbered by the database. You may also notice that there are 3 domains: 01, 02 and 03. A CATH generated ribbon representation of the structure of the 3 protein domains, is shown below:

There is also a table telling us which amino acid residues each domain is consisted of (start PDB residue-stop PDB residue), and schematic presentation of domain composition. This information is very valuable, for example when you make a homology model of a multidomain protein.

To keep this page length within reasonable Web-limits, I will continue of protein domains on the following page. As usual, you may always go back to the outline of the protein structures chapter, if you want to jump to some other pages.

Introduction to protein databases


Which protein database should I use? Where to find a protein structure I am interested in? There are many protein databases on the Internet. Some of them are of general character, but some are dedicated to specific protein families, specific metabolic pathways, etc. Here I will discuss some general-character databases. The first question, which may arise when working with a protein structure, could be something like: Where I can find a protein structure I am interested in? And another question, which many people dont ask, once they get access to a protein structure file is: What is actually inside that file? The last question is essentially about how to put all the beautiful 3D structures with all the helices, strands, loops, etc. into a single file. The main protein database for protein structure information is the Protein Data Bank, created sometime in the beginning of the 1970ties. Believe me or not, but databases existed already at that time, even if it almost feels like the Stone Age. Only few structures existed in the Protein Data Bank (PDB) at that time, and the only experimental method for protein structure determination available was that of protein X-ray crystallography. The real structural revolution, as you may see from the figure below, started in the beginning of the 1990ties:

Why then? One of the reasons was that cloning techniques started to enter the lab and the amount of protein available for crystallization increased substantially. Before the cloning era people had to purify proteins from cells, and apparently the cells did not have the need to express large quantities of the proteins we needed for crystallization. Therefore, to obtain few milligrams of protein for crystallization one needed a huge amount of cells. Cloning solved the problem. Another important factor was the introduction of synchrotron radiation. Synchrotrons, like MAX Lab in Lund, Sweden, ESRF in Grenoble, France, or DESY in Hamburg, Germany. provide very high intensity X-rays, which may be used for collecting high quality X-ray diffraction data from crystals. The third factor was probably the introduction of personal computers, relatively cheep and with ever increasing power. As usual Mac was of course first, but the cheaper Dos-based machines took over the market very soon, especially after the introduction of Windows in the beginning of the 1990ties. And cheaper computers mean new software. That was when the number of protein structures started to increase dramatically. Then came the era of structural genomics- large consortia were formed with the aim to develop new technology for solving huge amounts of protein structures. One such consortium is the SGC. And with a larger number of structures available, the number of protein databases started to increase and new tools for the analysis of protein sequence and structure were rapidly developing. Currently every new structure has to be deposited with the protein data bank in order for the research group to be able to publish the paper based on the structure. So, how to get a structure and what is inside the PDB file? Getting a structure is very easy, all you need to do is to go to the PDB and type in the name of the structure you are looking for:

I typed in the name of a protein I know, magnesium chelatase. We should have got only one single hit, since there is only one X-ray structure of a magnesium chelatase (actually one of the subunits). However, there are several other hits listed there and they have very little to do with magnesium chelatase. This is something I don't like about the protein data bank. When looking for this type of protein databases, I prefer to use PDBsum or PDBe (PDB Europe). When I type in the name of the same protein, I get a single hit. The search function in the original PDB does seem to work properly, however, they have got great education stuff on using the protein data bank. I recommend having a look at this, you may find the link to the educational material at the bottom of the sidebar menu. Let us use PDBsum as an entrance to our exploration of interesting protein databases. The main page looks something like that:

And the search result I get from typing in "magnesium chelatase" in the text serach area looks like that:

All I need to do now is to click on the 1g8p on the left, this is the PDB code for this particular structure. I use the same structure in the homology modeling tutorial. What we get is this:

But this is just the top of the page. If you click here, you will get the page with all the links to other protein databases containing information related to this structure, information about the article, where the structure was published, references which cited that publication, etc. I like it, and I think it is wonderful to be able to squeeze so much information in a single page.

Here you can see the secondary structure along the amino acid sequence- important information when, for example, doing amino acid sequence alignment and homology modeling. On the right side you will find links to several protein databases:

On the right hand side you will find links to several other protein databases. Among these are CATH and SCOP for protein domain classification, Proteopedia which provides a lot of detailed information, to the electron density server (EDS), which can be used to view the electron density into which the protein structure was built, etc.

The Protein Databank (PDB) Coordinate File


I will continue the discussion of protein databases on this page. After we got acquainted with the protein databank (PDB) and PDBsum it is time to retrieve the coordinate file of the structure and look inside it. It is called a "coordinate file" simply because it contains a list of all the atoms of the protein (at least the ones visible in the electron density map calculated on the basis of the X-ray data) each with its x,y,z coordinates in a conventional orthogonal coordinate system. I will go into the details later. We can retrieve the file, save it on the hard drive of the computer and then use it the way we like. For example, we can read it into the SwissPDB viewer graphics software and examine the structure. If we started our search using PDBsum, we can use the links panel on the right hand side and choose PDB or PDBe. I personally prefer

PDBe, the European version of the protein databank. Assuming that we are still working with the PDB entry we had at the previous page, what we get after clicking on the PDBe link at PDBsum is:

PDB provides a lot of important information on the structure, like in the table above. For example, the second row tells us that the structure was solved by protein crystallography, the third row gives us important data on the quality of the protein structure: Resolution (described here) being the most important parameter. The R-factor is another essential parameter for the assessment of X-ray structure quality. It tells us how well a certain structure fits the experimental data. The lower the value of the R-factor, the better the structure is. Normally refined protein structures have R-factor values below 25%. There is also a link to the publication, where this structure was described the first time, a list of keywords which describe the structure: Parallel beta-sheet, P-loop, Rossman fold (mentioned it here), photosynthesis, which refers to the function of the protein. To download the file with atomics coordinates from the protein databank we need to click the link on the right-hand side menu and we will get the following choices:

If we click the very first link, which says PDB, we will get the coordinates in a new web window. This can be useful to examine the content of the file, however, to download the file we need to right-click (on WIndows PC), or control-click on a Mac (you may also right-click on a Mac if you have a proper mouse), and choose "save as" from the menu. After that you can use the file for whatever purposes you want, for example examine it using the SwissPDB Viewer. Now, let us have a look inside the file. It is a normal text file, although formatted in a special way, since for all amino acids in the file each record, like the name of the amino acids, the x,y,z coordinated of the atoms, etc, is positioned exactly at a certain distance (number of spaces) from the left margin. The programs which read the file know exactly were each record is located. This is an old way of arranging information, however, it is still used in protein crystallography and protein databank files. The file begins with some information on the protein, the depositors, organism it comes from, etc.:

Next, you will find information on the X-ray experiment, like resolution of the X-ray structure. I will explain some of them in the X-ray chapter of this course. If we scroll down we will get to the description of the secondary structure elements of the protein in question:

The text here simply tells us that, for example, helix 1 starts at Pro A 22 and finishes at Ile A 26. Helix 2 starts at Gln A 29 and stops at Asp A 42, etc. A similar description follows for the Beta-sheets and the strands within them. Scrolling down from this place will show us how the actual coordinates of the atoms are stored in the protein databank:

First of all notice that the structure starts from amino acid Arg 18! No amino acids from 1 to 17. What happened? The reason they are absent is that there was no electron density for these residues, which normally depends on a high flexibility of this region of the structure. Michel, who was my student at that time and built this structure, could not identify the correct positions for these amino acids and they were never built into the structure. You need to be aware that many structures in the protein PDB may have missing parts, sometimes in loop regions, sometimes just side chain, and in the worse cases a whole domain may be missing. The numbers after the first record in the file, ATOM, are just sequential numbers of the atoms in the structure. This is followed by the atom type. For example, CA means C-alpha, the carbon atom to which the side chain of the amino acid is attached. The next carbon atom is C-beta, and following atoms are named after the Greek alphabet, gamma, delta, etc. Except C-alpha, main chain atoms do not have any Greek letters attached to them. They are just C, O and N. After the atom type you will recognize the name of the amino acid, followed in this file by a letter A. This is the so called chain identifier. In cases when the structure is consisted of several polypeptide chains, each will get its own chain identifier, like A, B, C, etc. The 3 numbers which follow (e.g., 14.699, 61.369, 62.050 for the very first atom) are the actual x,y,z coordinates of the atom. They describe the position of each atom in a conventional (for the protein databank) orthogonal coordinate system. If we can describe the position of each atom in the protein, we will obviously be able to draw the whole tertiary structure. A graphics program, when it reads the coordinates from the protein databank file, simply connects the atoms to each other according to some distance criteria, which are often among the default parameters of the program, thus creating the graphics view we are accustomed to. For example, we know that C-C distance is 1.54 and this can be used to connect two carbon atoms to each other. The x,y,z coordinates are followed by a number, which is one in most of the cases. This is called atom occupancy. Sometimes the side chain of a particular amino acid, but even main chain atoms, may have two or more different conformations. These can be distinguished in the electron density map of the structure. In this case the crystallographer will build both conformations into the electron density and refine a parameter called occupancy, for each conformation. In protein databank files these conformations are called "alternative conformations" and often marked with "ALT". The occupancy numbers for each alternative conformation will be less than 1 (1 corresponds to 100% occupancy), for example it may be 0.5/0.5 (50/50), when both conformations are equally occupied, or 40/60, or some other numbers. The numbers in the last column in the protein PDB file describe the temperature factors, or B-factor, for each atom in the structure. The B-factor describes the displacement of the atomic positions from an average (mean) value. For example, the more flexible an atom is the larger the displacement from the mean position will be (mean-squares displacement). In graphics programs we can often color a protein according to B-factor value. Usually the areas with high B-factors are colored red (hot), while low B-factors are colored blue (cold). An inspection of a protein databank structure with such coloring scheme will immediately reveal regions with high flexibility in the tertiary structure of the protein. This concludes my brief overview of the PDB coordinate file. I have made a summary of links to the protein databases discussed on the site and many additional databases, which may help you explore other features associated with structural data. You may also return to the outline of the protein structures, if need to jump to another content page.