MetaPred2CS is a meta-predictor specifically designed to predict interactions in prokaryotic two-component system,
i.e the pairing between histidine kinases and response regulators.
MetaPred2CS is based on Support Vector Machine (SVM)
that combines six individual sequence based protein-protein interaction prediction
methods: two co-evolutionary based methods (in-silico
two hybrid (i2H) and mirror tree (MT)
methods) and four genomic context
methods (phylogenetic profiling (PP),
gene fusion (GF),
gene neighbourhood (GN)
gene operon methods (GO)).
All methods implemented in MetaPred2CS require a BLAST
identify homologous proteins to the query proteins. In the case of i2H and MT methods, the search is perfomed against
from UniProt. In the case of genomics context based methods, sequences search are performed against our local reference
genome database that included 243 genomes (see Table 1). The genomes and genome annotation were downloaded
from NCBI database
and operon architecture and transcription
units from here.
Lastly, Metapred2CS has been trained in single-domain Histidine kinase and Response regulator proteins,
in the case of hybrid two-component system proteins, users need to decompose the protein sequence
in separate domains
and submit them separately.
Training & Test - The P+ and P- datasets
The P+ and P-
datasets contain 113 interacting
and 1134 non-interacting experimentally validated TCS pairs respectively,
and were compiled and manually curated from the current literature. These datasets and different sub-classes of these sets
were used to train and test the MetaPred2CS using a k-fold cross validation strategy.
i2H and MT methods
At the second section of submission form users have the option of tune the default values of the six different
parameters used by i2H and MT methods (Figure 1).
Figure 1. Parameter i2H and Mirror tree methods.
i2H and MT methods requires multiple sequence alignments that are created automatically from user's
query proteins using BLAST and clustalw.
For BLAST search, users can adjust the
The rest of the parameters will be used as default such as the as scoring matrix,
which is BLOSUM_62.
The Min Matches and Max Matches parameters stand for the minimum and maximum number of
common species in both multiple
sequence alignments between both query proteins. These parameters are used to filter the number of sequence and reduce
the number of them. If the minimum number of common species is not reached, the job will be terminated.
In this case, users need to re-submit using lower cut-off value for the Min Matches paremeter and/or a higher
b_value for the BLAST search.
The final set of parameters is the correlation function and the substitution matrix.
The available options for the scoring function are:
the substitution matrice:
Genomic context methods
At third section of submission form users has de option of tune the default parameters in the genome methods (Figure 2).
Figure 2. Paramers Genomic context methods.
In the case of Phylogenetic profiles users can change the E-value cut-off of the BLASTP search used to detect
homologue proteins across the reference. This parameter is very important as discussed in
The second parameter accounts for
to infer functional linkages between proteins.
Likewise, in the case of Gene Fusion methods, users can also change the E-value cut-off on BLASTP . Moreover, users can
tune the Bits cut-off on the local alignments performed using
for query proteins which using
Smith-Waterman algorithm to identify fusion events.
Finally, in the case Gene Neighbourhood & Gene Operon methods, the two tunable parameters are the E-value cut-off on the
BLASTP search and the cut-off genomic distance. The latter is important to define neighbouring genes and have been
extensive discussed in the following works:
Salgado H. et al., 2000;
Strong M. et al., 2003;
Ermolaeva MD. et al, 2001;
Moreno-HG. et al., 2002;
Ross O. et al., 1999
being 200bp (default) the generally accepted cut-off value.