You are on page 1of 18

BIOINFORMATICS END-TERM PROJECT

ASHWIN SAXENA
(2017B1A40453P)

About LIN28B Gene -


LIN28B (Lin-28 Homolog B) is a Protein Coding gene in humans. The protein encoded by this gene belongs to the lin-28 family,
which is characterized by the presence of a cold-shock domain and a pair of CCHC zinc finger domains. This gene is
highly expressed in testis, fetal liver, placenta, and in primary human tumors and cancer cell lines. It is negatively regulated
by microRNAs that target sites in the 3' UTR, and overexpression of this gene in primary tumors is linked to the repression of let-
7 family of microRNAs and derepression of let-7 targets, which facilitates cellular transformation.

Analysis on the sequence data obtained was performed using Python.

Results obtained:

1. %GC content = 36.44%

Amino Acid Single Letter Code Frequency

Alanine A 4.8%

Isoleucine I 2.8%

Leucine L 4.4%

Valine V 4.4%

Phenylalanine F 3.6%

Tryptophan W 0.4%

Tyrosine Y 0.79%

Asparagine N 2.00%
Cysteine C 4.00%

Glutamine Q 5.20%

Methionine M 2.40 %

Serine S 10.80%

Threonine T 6.00 %

Arginine R 6.00%

Histidine H 3.20%

Lysine K 9.20%

Aspartic Acid D 1.60 %

Glutamic Acid E 9.60 %

Glycine G 11.60%

Proline P 9.60%

Inferences -

The presence of hydrophilic amino acids like serine, arginine, glutamic acid, etc. supports the role of the protein as a
secretory product of endothelial cells.

The presence of some hydrophobic amino acids like proline suggests that the core is hydrophobic.

Cysteine is also present in a sufficient amount to allow for the formation of disulfide bridges.

Secondary structure prediction of protein using different web servers like GOR IV, PHD, PredictProtein, Psipred
and Predator.

1. GOR IV
2. JPRED
3. PHD
4. PREDATOR

5. PREDICT PROTEIN

6.
7.
6.PSIPRED

3D VISUALISATION USING Icn3d

Ribbon
Lines

Cylinder and Plate

C Alpha Trace
B-Factor Tube

Backbone
Schematic

Sphere

Sticks
Strand

Balls and Sticks


CONCLUSIONS –

1. In all the Web servers predicted protein secondary structure, Majority of them belong to the Random Coil
averaging out near to more than 50 %.
2. In all the predicted structures, random coil is followed by Extended Strand structure, averaging out near to
40 %.
3. Among all the constituent amino acids, Top 5 most abundant in decreasing order are Glycine(G), Serine(S),
Glutamate (E), Proline(P) and Lysine(K)
4. Some consensus could be drawn First eight Amino acids are random coils and near the end of amino acids
are also random coils.
5. Among all the best predictions and clarity is of PREDICTPROTEIN.
6. From the images provided above, we see that the protein does indeed contain a majority of alpha-helix
structures and coils.

DOT Plot Analysis


Self DOT Plots were made using EMBOSS DOTMATCHER algorithm. Screenshots and inferences are presented below.
1. Self DOT Plot of DNA sequence

2. Self DOT Plot of Protein Sequence


NCBI BLAST Analysis
. Using DNA sequence data

1)For prokaryotes

No significant results found with tBLASTn.

cDNA sequence was found using the tool Reverse Translate and setting the codon usage parameter for Escherichia
coli. This cDNA sequence was then fed into the BLASTn algorithm. No significant results were found for prokaryotes
as well.

2) For Eukaryotes: BLASTn was performed using the DNA sequence to find the top 20 eukaryotic genes. 20
eukaryotes were considered because no prokaryote hits were obtained.

S. no. Species Total Score Query cover E-value %identity Accession


Number
1. Pan 9888 100% 0 99.47 NM_00100431
troglodytes 7
2. Gorilla 9579 97% 0 99.36 XP_018885752.2

3. Pongo abelii 9339 97% 0 98.55 XM_024249173.


1

4. Trachypithec 8973 100% 0 96.42 XM_033234721


us francoisi
5. Rhinopithecu 8922 100% 0 96.27 XM_010352514
s roxellana
6. Piliocolobus 8913 100% 0 96.22 XM_023205833
tephrosceles
7. Theropithecu 8730 97% 0 96.40 XM_025382883
s gelada
8. Macaca 8696 97% 0 96.31 XM_015449029
fascicularis
9. Panthera 1781 25% 0 89.90 XM_019442981
pardus
(leopard)

10. Myotis davidii 2091 49% 0 83.36 XM_015572723

II. For protein sequence data

1)For prokaryotes

No significant result was found.

2) For Eukaryotes: BLASTp was performed using the protein sequence data of human gene lin28B to find the top 10
eukaryote hits from different species. 10 eukaryotes were considered because no prokaryotic hits were obtained.
S. no. Species Total Score Query cover E-value %identity Accession
Number
1. Pan 503 100% 8e-180 99.60 NM_001004317
troglodytes
2. Gorilla 501 97% 3e-179 99.20 XP_018885752.2

3. Pongo abelii 501 97% 6e-179 99.20 XM_024249173.


1

4. Trachypithec 488 100% 5e-174 97.60 XM_033234721


us francoisi
5. Rhinopithecus 481 100% 6e-171 96.39 XM_010352514
roxellana
6. 464 100% 3e-164 91.60 XP_006881247
Elephantulus
edwardii

7. 449 98% 6e-158 91.09 XP_033274477


Orcinus orca

8. Macaca 490 97% 1e-174 98.00 XM_015449029


fascicularis
9. Panthera 449 96% 6e-158 91.09 XM_019442981
pardus
(leopard)

10. Myotis davidii 471 98% 7e-167 95.60 XM_015572723

Inferences:

• From the BLASTP and BLASTN results we observe that the closest relatives (sequence wise) to the human gene
lin28B are from the apes and monkeys, as should be expected. The next closest relatives are mostly mammals.

• The high sequence similarity (>90%) obtained through BLASTP indicates that the protein product of lin28B is highly
conserved in the animal kingdom.

Pairwise Global Alignment using Dynamic Programming


EMBOSS Needle (default parameters) was used to align both the DNA and the protein sequences obtained from
BLAST search by using the Needleman-Wunsch dynamic programming algorithm. Screenshot of a sample output is
given below.

Protein –
DNA-

S. no. Species Score %identity Accession


Number
1. Pan troglodytes 30257.5 98.20 NM_001004317
2. Gorilla 29550 97.50 XP_018885752.2

3. Pongo abelii 25907 95.6 XM_024249173.


1

4. Trachypithecus 29175 95.40 XM_033234721


francoisi
5. Rhinopithecus 29527 95.20 XM_010352514
roxellana
6. Piliocolobus 29104 95.10 XM_023205833
tephrosceles
7. Theropithecus 28759 94.60 XM_025382883
gelada
8. Macaca 28697 94.30 XM_015449029
fascicularis
9. Panthera 6975 22.4 XM_019442981
pardus
(leopard)

10. Myotis davidii 12739 43.5 XM_015572723

Higher similarity is obtained due to the reason that the longer sequences allow for a greater number of matching
bases and this results in higher similarity.

Both DNA and protein sequence data suggest that the apes and primates are most closely related to the human gene
lin28B.

Multiple Sequence Alignment


Multiple Sequence Alignment was performed by using the CLUSTALW algorithm in the software package MEGA X.

I. Using results from BLASTN for DNA

In the above figures we can see:

• Matches, mismatches and gaps accommodated to align the DNA sequences.

• * marks on the top of certain columns depict the columns that are conserved across all the DNA sequences.

Inferences

• For the DNA sequence MSA, conserved sites = 457


II. Using results from BLASTP for Proteins

MSA obtained:

In the above figure we can see:

• Matches, mismatches and gaps accommodated to align the protein sequences.

• * marks on the top of certain columns depict the columns that are conserved across all the protein sequences.

Inferences –

For the protein sequence MSA, conserved sites = 307. Out of 349 aa (in most species), hence we can say this protein
is highly conserved across various animals.

You might also like