Professional Documents
Culture Documents
Assignment No 1 Part 2
Assignment No 1 Part 2
In order to understand the mechanism of infection and develop better treatment and prevention
of infectious diseases, host-pathogen interactions are important. The protein interaction map will
guide research on key PPIs that may lead to human cells being adhered to, colonized, and even
invaded by pathogens. Host-pathogen PPI prediction, however, has its challenges.
Dataset description:
We used Yersinia pestis and bacillus anthracis positive ppi’s interaction files from PHISTO
databases and then match these corresponding sequences from ncbi and uniprot in order to
predict protein interaction.
We take 4040 interactions of Yersinia pestis; many of the interactions are ignored due to the
deletion of the records from databases like ncbi and uniprot, because the data that cannot
contribute anything to the results are considered noise.
Furthermore many of the interactions are excluded from our datasets that contain uncommon
amino acids, because the amino acids that occurred too often are also considered as the data
that cannot contribute much to the results.
At the end the ppi’s are encoded into Pseudo amino acid composition, or PseAAC, that
represent protein samples for improving protein subcellular localization prediction and
membrane protein type prediction.
The same method is applied to bacillus anthracis dataset also get the interaction files from
PHISTO database. The no of positive interactions are 3003.
We construct negative data by selecting negative protein pairs randomly from all possible
Protein pairs except the known ones interactions and we label these data as negative.
We take positive and negative interactions of equal size, but the size may vary after the
experimental results, as the study shows the author selected negative data by the amount 1:1
1:2 1:3 and find a very minor effect of changing the size of positive and negative data.
The format of the list is: amino acid name - 3 letter code - 1 letter code.
alanine - ala - A
arginine - arg - R
asparagine - asn - N
aspartic acid - asp - D
cysteine - cys - C
glutamine - gln - Q
glutamic acid - glu - E
glycine - gly - G
histidine - his - H
isoleucine - ile - I
leucine - leu - L
lysine - lys - K
methionine - met - M
phenylalanine - phe - F
proline - pro - P
serine - ser - S
threonine - thr - T
tryptophan - trp - W
tyrosine - tyr - Y
valine - val - V
import pandas as pd
dataset = pd.read_csv("//content/golf-dataset1.csv")
dataset.head()