You are on page 1of 2

Computational Analysis of Protein Interactions between

pathogens and its host


Introduction:

In order to understand the mechanism of infection and develop better treatment and prevention
of infectious diseases, host-pathogen interactions are important. The protein interaction map will
guide research on key PPIs that may lead to human cells being adhered to, colonized, and even
invaded by pathogens. Host-pathogen PPI prediction, however, has its challenges.

Dataset description:

We used Yersinia pestis and bacillus anthracis positive ppi’s interaction files from PHISTO
databases and then match these corresponding sequences from ncbi and uniprot in order to
predict protein interaction.
We take 4040 interactions of Yersinia pestis; many of the interactions are ignored due to the
deletion of the records from databases like ncbi and uniprot, because the data that cannot
contribute anything to the results are considered noise.
Furthermore many of the interactions are excluded from our datasets that contain uncommon
amino acids, because the amino acids that occurred too often are also considered as the data
that cannot contribute much to the results.
At the end the ppi’s are encoded into Pseudo amino acid composition, or PseAAC, that
represent protein samples for improving protein subcellular localization prediction and
membrane protein type prediction.
The same method is applied to bacillus anthracis dataset also get the interaction files from
PHISTO database. The no of positive interactions are 3003.

Negative data preparation:

We construct negative data by selecting negative protein pairs randomly from all possible
Protein pairs except the known ones interactions and we label these data as negative.
We take positive and negative interactions of equal size, but the size may vary after the
experimental results, as the study shows the author selected negative data by the amount 1:1
1:2 1:3 and find a very minor effect of changing the size of positive and negative data.

Our dataset contains the following 51 fields:


1. Target: 1 for positive interactions and 0 for negative interactions.
2. Ids: pathogen and host id from which we can track the record of the specific sequence.
3. Amino acids which are represented by the alphabets and combining the first alphabet of the
corresponding genes.
For example: in bioinformatics the alphabet A refers to alanine and we combine this letter with
the gene with which this particular amino acid belongs to.
A-H Means alanine of humans for that particular record and A-Y Means alanine of Yersinia
pestis, and likewise.

The format of the list is: amino acid name - 3 letter code - 1 letter code.
alanine - ala - A
arginine - arg - R
asparagine - asn - N
aspartic acid - asp - D
cysteine - cys - C
glutamine - gln - Q
glutamic acid - glu - E
glycine - gly - G
histidine - his - H
isoleucine - ile - I
leucine - leu - L
lysine - lys - K
methionine - met - M
phenylalanine - phe - F
proline - pro - P
serine - ser - S
threonine - thr - T
tryptophan - trp - W
tyrosine - tyr - Y
valine - val - V

DATASET IMPLEMENTATION AS CSV FILE IN PYTHON:

import pandas as pd

dataset = pd.read_csv("//content/golf-dataset1.csv")

dataset.head()

You might also like