Distill 2.0 (Porter, Porter+, PaleAle, BrownAle, XStout, XXStout, 3Distill): help and references
The Servers: description
Distill is a single interface to
all the servers described below. Distill_multi
is an interface to send multiple queries (up to 32Kbytes in total) in FASTA format
to any number of the servers described below.
Porter is a server for protein secondary structure prediction based on an ensemble of 45 BRNNs (bidirectional recurrent neural networks). Porter's feature include:
- Efficient input coding.
In Porter the input at each residue is coded as a letter out of an alphabet of 25.
Beside the 20 standard amino acids, B (aspartic acid or asparagine), U (selenocysteine),
X (unknown), Z (glutamic acid or glutamine) and . (gap) are considered.
The input presented to the networks is the frequency of each of the 24 non-gap symbols,
plus the overall proportion of gaps in each column of the alignment.
- Output filtering and incorporation of predicted long-range information.
In Porter the first-stage predictions are filtered by a second network.
The input to this network includes the predictions of the first stage network
averaged over multiple contiguous windows, covering 225 residues.
- New, large training sets.
Porter is trained on the 25% pdb_select list of December 2003
(available here).
After processing by DSSP the set contains 2171 proteins and 344,653 amino acids.
Profiles obtained from multiple sequence alignments have been shown to improve
significantly SS prediction performances. In Porter we use multiple sequence alignments
extracted from the NR database as available on March 3 2004, containing over 1.4 million
sequences. The database is redundancy reduced at a 98% threshold,
leading to a final 1.05 million sequences. The alignments are generated by three runs of
PSI-BLAST.
- Large ensembles of models
Five two-stage BRNN models are trained independently to build Porter.
Differences among models are introduced by two factors: stochastic elements in the
training protocol, such as different initial weights of the networks and different shuffling
of the examples; different architecture and number of free parameters of the models.
A copy of each of the 5 models is saved at regular intervals (100 epochs) during training.
9 such copies for all the 5 models are ensemble averaged (45 models in total) in Porter.
Porter, tested by a rigorous 5-fold cross validation procedure,
achieves 79% correct classification on the "hard" CASP
3-class assignment.
A paper describing Porter has been published in the journal Bioinformatics (toll-free link).
Note that, when available, homology information is now provided to Porter as a further input. This results in substantially improved predictions.
A description of the algorithms we use to incorporate homology can be found here (BMC Bioinformatics).
When no satisfactory templates can be found by PSI-BLAST we try to find remote homologues by our own fold recognition software, which we describe in a paper in the journal Proteins.
Porter+ is a server for
the prediction of a new alphabet of local structural motifs. The
motifs are built by applying multidimensional scaling (MDS)
and clustering to pair-wise angular distances for multiple Φ and Ψ
dihedral angle values collected from high-resolution protein structures
(Sims et al. 2005).
This principled method allowed the visualisation of protein backbone fragments
in a reduced 3D conformational space, and lead to the identification of a small number of conformational clusters
that are populated by real backbones. In Porter+ we map these clusters into a conformational alphabet of 14 letters,
representing structural motifs for tetra-peptides.
Porter+'s architecture is similar to Porter's
one and classifies approximately 60% of residues into the correct structural motif, roughly 30% above
a base-line statistical predictor.
A paper describing Porter+ has been published in the Journal of Computational Biology.
Note that, when available, homology information is now provided to Porter+ as a further input. This results in substantially improved predictions.
A description of the algorithms we use to incorporate homology can be found here (BMC Bioinformatics).
When no satisfactory templates can be found by PSI-BLAST we try to find remote homologues by our own fold recognition software, which we describe in a paper in the journal Proteins.
Below is a table describing the sequences of Φ and Ψ angles and corresponding 1-letter code for structural motifs.
PaleAle is a server for the prediction of protein relative solvent accessibility.
Each amino acid is classified as being in one of 4 (approximately equally frequent) classes:
- B=completely buried (0-4% exposed)
- b=partly buried (4-25% exposed)
- e=partly exposed (25-50% exposed)
- E=completely exposed (50+% exposed)
The architecture of PaleAle's classifier is an exact copy
of Porter's (described above). PaleAle's accuracy, measured on the same large, non-redundant set adopted to train Porter (described above) exceeds 55% correct 4-class classification, and 80% 2-class classification (Buried vs Exposed, with 25% threshold).
Note that, when available, homology information is now provided to PaleAle as a further input. This results in substantially improved predictions.
A description of the algorithms we use to incorporate homology can be found here (BMC Bioinformatics).
When no satisfactory templates can be found by PSI-BLAST we try to find remote homologues by our own fold recognition software, which we describe in a paper in the journal Proteins.
BrownAle is a server for the prediction of protein Contact Density. We define Contact Density as the Principal Eigenvector (PE) of a protein's residue contact map at 8Å, multiplied by the principal eigenvalue. The PE is a sequential encoding of the notion of contact among residues, which holds most of the information contained in the contact map, and hence provides a compact but highly informative representation of a protein's structure.
Contact Density is useful for the ab initio the prediction of protein of protein structures for many reasons:
- algorithms exist to reconstruct the full contact maps from the PE for short proteins (e.g. see Porto et al. 2004), and correct contact maps lead to correct 3D structures;
- Contact Density may be used directly, in combination with other constraints, to guide the search for optimal 3D configurations;
- Contact Density may be adopted as an extra input feature to systems for the direct prediction of contact maps, as in the XXStout server described below;
- predicted PE may be used to identify protein domains.
BrownAle predicts Contact Density in 4 classes. The class thresholds are assigned so that the classes are approximately equally numerous, and correspond to very low, medium-low, medium-high and very high Contact Density.
BrownAle's architecture is an exact copy
of Porter's (described above). The accuracy of BrownAle, measured on the same large, non-redundant set adopted to train Porter (described above) is 46.5% for the 4-class problem, and nearly 73% if the 4 classes are mapped into 2 (dense vs. non dense).
A paper describing BrownAle's and XXStout's methods has been published in the journal BMC Bioinformatics.
Note that, when available, homology information is now provided to BrownAle as a further input. This results in substantially improved predictions.
A description of the algorithms we use to incorporate homology can be found here (BMC Bioinformatics).
When no satisfactory templates can be found by PSI-BLAST we try to find remote homologues by our own fold recognition software, which we describe in a paper in the journal Proteins.
Shandy is a server for the prediction of protein domain boundaries. Boundary definitions are extracted from the SCOP database.
We use predicted secondary structure, solvent accessibility, contact density and structural motifs as an input to the predictor, alongside the sequence and multiple alignments. We also search for PDB and SCOP templates, which, when available, are passed to the predictors as a further input, greatly enhancing its quality. When no satisfactory templates can be found by PSI-BLAST we try to find remote homologues by our own fold recognition software, which we describe in a paper in the journal Proteins.
A paper describing Shandy's methods has been published in the journal BMC Bioinformatics.
XStout is a server for the prediction of coarse protein topologies.
A protein is represented by a set of rigid rods associated with its secondary structure elements
(α-helices and β-strands, as predicted by Porter).
First, we employ cascades of recursive neural networks derived from graphical models to predict the
relative placements of segments. These are represented as distance maps discretised into 4 classes.
The discretisation levels ((0Å,10Å),(10Å,18Å),(18Å,29Å),(29Å,∞Å))
are statistically inferred from a large and curated data set.
Coarse 3D folds of proteins are then assembled starting from topological information
predicted in the first stage. Reconstruction is carried out by minimising a cost function taking the
form of a purely geometrical potential.
The reconstruction procedure is fast and often leads to topologically correct coarse structures,
that could be exploited as a starting point for various protein modelling strategies.
Both coarse distance maps and a number of coarse reconstructions are produced by XStout.
A paper on XStout's methods has been published in the Journal of Computational Biology.
Notice that XStout is not integrated into Distill 2.0 anymore. The links in this section will direct you to the server that implements Distill 1.0.
XXStout is a server for the prediction of protein residue contact maps. Two residues are considered in contact if their C-αs are closer than a given threshold. XXStout predicts contacts at three different thresholds: 6Å, 8Å and 12Å.
The contact maps are predicted as follows: protein secondary structure, solvent accessibility and contact density are predicted from the sequence using, respectively, Porter, PaleAle and BrownAle; ensembles of two-dimensional Recursive Neural Networks predict the contact maps based on the sequence, a 2-dimensional profile of amino-acid frequencies obtained from a PSI-BLAST alignment of the sequence against the NR, and predicted secondary structure, solvent accessibility and contact density. The introduction of contact density as an intermediate representation, which is novel in XXStout, improves significantly the performances of the system.
XXStout is trained on the large, non-redundant set adopted to train Porter (described above), after the exclusion of all the proteins longer than 200 residues. The tables below summarise the performances of XXStout on a test set composed by 1/5 of this set (327 proteins). Performances are given for the protein_length/5 and protein_length/2 contacts with the highest probability, for sequence separations of at least 6, at least 12, and at least 24, in CASP style.
Note that, when available, homology information is now provided to XXstout as a further input. This results in substantially improved predictions.
A description of the methods to incorporate homology information can be found in this paper in the journal BMC Structural Biology.
When no satisfactory templates can be found by PSI-BLAST we try to find remote homologues by our own fold recognition software, which we describe in a paper in the journal Proteins.
| separation ≥ 6 | separation ≥ 12 | separation ≥ 24 |
8Å | 46.4%(5.9%) | 35.4%(5.7%) | 19.8%(4.6%) |
12Å | 89.9%(2.3%) | 62.5%(2.0%) | 49.9%(2.2%) |
Top protein_length/5 contacts
Performances as: precision%(recall%)
| separation ≥ 6 | separation ≥ 12 | separation ≥ 24 |
8Å | 36.6%(11.8%) | 27.0%(11.0%) | 15.7%(9.3%) |
12Å | 85.5%(5.5%) | 55.6%(4.6%) | 43.8%(4.9%) |
Top protein_length/2 contacts.
Performances as: precision%(recall%)
A paper describing BrownAle's and XXStout's methods has been published in the journal BMC Bioinformatics.
3Distill
3Distill is a server for the prediction of protein structures. We accept queries of up to 250 amino acids. 3Distill relies on a fast optimisation algorithm guided by a potential based on secondary structure predicted by Porter solvent accessibility predicted by PaleAle, contact density predicted by BrownAle, and residue contact maps predicted by XXStout.
A preliminary implementation of 3Distill took part to CASP6 (group Distill, CASP ID 0348): its model 1 was ranked in the top 20 predictors out of 181 for GDT_TS on Novel Fold hard targets, and for Z-score for all Novel Fold and Near Novel Fold targets.
Note that, when available, homology information is now provided to 3Distill as a further input. This results in substantially improved predictions for more than half of the queries we receive.
When no satisfactory templates can be found by PSI-BLAST we try to find remote homologues by our own fold recognition software, which we describe in a paper in the journal Proteins.
Also note that now we return full-atom models.
SCLpred
SCLpred is a specialised server for the ab initio prediction of protein subcellular localisation in eukaryotes. The server has three components, trained on proteins from: Animals; Plants; Fungi. The subcellular localisation classes we predict are 4 for Animals and Fungi (Cytoplasm; Mitochondrion; Nucleus; Secretory) and 5 for plants (the same 4 as for Animals and Fungi, plus Chloroplast).
The server is based on a new neural network we have developed, and, in our tests, achieves state-of-the-art results, with correct classification rates of approximately 68-71% for plants, 67-75% for fungi and 77-82% for animals.
A paper describing SCLpred has been published in the journal Bioinformatics (toll-free link).
SCL-Epred
SCL-Epred is a generalised server for the ab initio prediction of protein subcellular localisation in eukaryotes. The server has a single component, trained on a set of 15,202 proteins from 723 different eukaryotic species. SCL-Epred predict proteins in 3 classes: Secreted; Membrane; Other (intracellular). In our tests SCL-Epred is the state of the art in its category, with 86% correct prediction and a Generalised Correlation of 75% when tested in 10-fold cross-validation.
A paper describing SCL-Epred is currently submitted.
Input formats
Email
Your email address, the place where the prediction will be delivered.
NOTE: Check that you typed your address correctly. A lot of
the queries handled don't receive an answer because of incorrect typing.
Query name
An optional name for your query. We strongly suggest that you use one.
The order in which you send your queries
may not correspond to the order in which you receive the answers.
When using Distill_multi no query name
is requested.
Predictions
If you use Distill or
Distill_multi
you can select which predictions you want to receive: Porter,
PaleAle,
BrownAle,
XStout.
XXStout.
SCLpred.
The default is all selected.
If multiple servers are selected, all the predictions for each protein sequence
will be collated into a single email. Distill_multi
will send one separate email for each protein sequence.
Input sequence(s)
The sequence of amino acids:
- You can submit bare sequences or sequences in FASTA format. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line must begin with a greater-than (">") symbol in the first column. The servers that process individual queries will ignore the description line, while Distill_Multi will use it as query name.
- Porter, PaleAle, BrownAle and XStout will handle only single sequences: if you send multiple sequences through their interface, they will be concatenated and treated as a single one. If you want to submit multiple sequences to any of the servers, please use Distill_multi.
- Spaces, newlines and tabs will be ignored, so feel free to have them in your query.
- Characters not corresponding to any aminoacid will be treated as X.
- Only 1 letter amino acid code understood. Please do not send nucleotide sequences. If so, A will be treated as Alanine, C as Cysteine, etc...
Output format
Replies are sent by email.
Porter, Porter+,
PaleAle and BrownAle's replies come as text, attached to the email, and so do SCLpred predictions.
Shandy's predictions also come as text, attached to the email, and are currently not integrated with the other servers - that is, if you submit a Shandy query you will only receive Shandy results.
You might have to "view attachments inline" in your web browser to see these replies.
If you submit multiple sequences through Distill_Multi
you will receive one separate email for each sequence.
Here you have an example of prediction:
Subcellular_Localisation:
ANIMAL: NUCLEUS
Confidence: high
PLANT: NUCLEUS
Confidence: high
FUNGI: NUCLEUS
Confidence: low
EUKARYOTES: OTHER (not SECRETED, not MEMBRANE)
Confidence: medium
Prediction:
VEPAVIAGIISRESHAGKVLKNGWGDRGNGFGLMQVDKRSHKPQGTWNGEVHITQGTTIL
CCHHHHHHHHHHCCCCCCCEECCCCCCCCCCEEEEECCCCCCCCCCCCHHHHHHHHHHHH
bhHHHHHHHHgIihHHBEEebsBbHHHHBEEEEEebhHgtBEEEEebhHHHHHHHHHHHH
beeBBBbBBBbbbbEBeEeBeEbbbEeBbBbBBBBBbEEEbEEEbEbebEebBebBBebB
ccccCcCCCCcccccccccCCCCCcccCcCCCCCCCcnNnnNNNNNncCcncCnnccnnc
The first lines report Subcellular Localisation predictions by SCLpred and SCL-Epred, with estimated confidence. Only predictions for the kingdoms you have ticked will be presented. SCL-Epred predictions will be presented only if selected.
The strings following the "Prediction:" line have the following meaning:
- Line 1: The 1-letter code of your protein primary sequence. This line is always present.
- Line 2: Secondary structure prediction by Porter:
- H = helix : DSSP's H (alpha helix) + G (3-10 helix) + I (pi-helix) classes.
- E = strand : DSSP's E (extended strand) + B (beta-bridge) classes.
- C = the rest : DSSP's T (turn) + S (bend) + . (the rest).
This line is always present.
- Line 3: Structural motifs by Porter+. This line is present if you asked for a Porter+ prediction. In the table below are the 1-letter codes for the 14 structural classes, and the ideal sequence of 4 pairs of Φ and Ψ defining them:
Class | &phi1 | &psi1 |
&phi2 | &psi2 |
&phi3 | &psi3 |
&phi4 | &psi4 |
b | 265 | 148 | 280 | 153 | 300 | 327 | 291 | 332 |
h | 274 | 152 | 301 | 321 | 297 | 322 | 293 | 322 |
H | 297 | 319 | 297 | 319 | 296 | 319 | 294 | 321 |
I | 294 | 334 | 270 | 346 | 279 | 138 | 293 | 336 |
C | 271 | 355 | 273 | 144 | 283 | 155 | 296 | 329 |
e | 253 | 144 | 254 | 144 | 279 | 149 | 299 | 333 |
E | 251 | 143 | 244 | 143 | 245 | 144 | 253 | 142 |
S | 255 | 147 | 284 | 138 | 267 | 341 | 231 | 154 |
t | 270 | 147 | 301 | 345 | 266 | 360 | 268 | 145 |
g | 294 | 327 | 283 | 346 | 250 | 1.7 | 268 | 147 |
T | 290 | 344 | 263 | 1 | 266 | 147 | 263 | 143 |
B | 292 | 349 | 248 | 148 | 252 | 144 | 254 | 145 |
s | 288 | 139 | 319 | 347 | 231 | 150 | 245 | 146 |
i | 262 | 343 | 234 | 156 | 288 | 326 | 295 | 324 |
- Line 4: Relative Solvent Accessibility prediction by PaleAle:
- B=completely buried (0-4% exposed)
- b=partly buried (4-25% exposed)
- e=partly exposed (25-50% exposed)
- E=completely exposed (50+% exposed)
This line is present if you requested a PaleAle prediction.
- Line 5: Contact Density predictions by BrownAle.
Each letter in the sequence represents a contact density class (defined as principal contact map eigenvector component
x max.eigenvalue):
- N=very low contact density (0,0.04);
- n=medium-low contact density (0.04,0.18);
- c=medium-high contact density (0.18,0.54);
- C=very high contact density (0.54,&inf;)
This line is present if you requested a BrownAle prediction.
- Shandy's output: Domain boundary prediction by Shandy come as a separate email, if you have submitted a query through the Shandy interface. The predictions report a string for the query sequence, and a second string representing the domain number for intra-domain residues, and a dash ('-') symbol for those residues that are predicted to be in a domain boundary.
- XStout's outputs come as 6 attachments.
- Attachment 1 (number.xstout4c) : the distance map. If the protein has N segments (as predicted by Porter. NOTE: Only Helices, and Strands of length at least 2 are considered), these will be listed at the beginning of the file, with their start and end position, e.g.:
# Segments identified
#seg start end type
#0 3 16 Helix
#1 22 26 Strand
#2 32 41 Strand
#3 45 61 Helix
In the remainder of the file the distance map prediction is provided, with one pair of segments on each line, followed by the interval, in Ångstrom in which their distance is predicted to fall, and by a confidence index between 0 (no confidence) and 1 (maximal confidence), e.g.:
# Contact map prediction
#seg1 seg2 dmin dmax confidence
0 1 10 18 0.06
0 2 10 18 0.05
0 3 29 infty 0.04
1 2 0 10 0.75
1 3 18 29 0.03
2 3 10 18 0.24
Distance classes predicted are (0Å,10Å), (10Å,18Å), (18Å,29Å) and (29Å,∞Å).
- Attachments 2-6 (number.x.topo.pdb, with x=1..5) : 5 coarse reconstructions, in PDB format. NOTE: points represented are the termini of all Helices and Strands of length 2 or greater - there will be 2N such points in a protein with N segments.
- XXStout's outputs come as attachments:
- number.xxstout06: residue map at 6Å
- number.xxstout08: residue map at 8Å
- number.xxstout12: residue map at 12Å
All maps are text files of space-separated probabilities. The i-th number on the j-th row (and the j-th number on the i-th row - they are the same) represents the estimated probability that residues in position i and j are in contact.
If you ask for png images of the maps, you will receive 2Lx2L pixel greyscale images (L=protein length) where pixels in positions (2i,2j),(2i+1,2j),(2i,2j+1) and (2i+1,2j+1) represent the estimated probability of contact between residue i and j (with probability=1:black and probability=0:white). Examples of predicted maps' images are represented below.
- 3Distill's outputs come as attachments in PDB format. 5 models are returned, each one containing all atoms in the protein. When the query is longer than 250 residues, instead of full atom models by 3Distill, fold predictions by XStout are returned. Please be aware that these are way less accurate, and do not incorporate homology information to known structures.
PaleAle and BrownAle's predictions always come also with secondary structure predictions by Porter.
XStout's predictions come with secondary structure by Porter and solvent accessibility by PaleAle.
XXStout's predictions come with Porter's, PaleAle's and BrownAle's predictions.
References
Distill as a whole
D. Baú, A. J. M. Martin, C. Mooney, A. Vullo, I. Walsh, G. Pollastri.
"Distill: A suite of web servers for the prediction of one-, two- and three-dimensional structural features of proteins"
BMC Bioinformatics, 7:402, 2006.
Open access abstract and PDF (BMC Bioinformatics web site).
Porter (secondary structure)
G.Pollastri, A.McLysaght.
"Porter: a new, accurate server for protein secondary structure prediction".
Bioinformatics, 21(8),1719-20, 2005.
Toll-free link to the article
SCLpred (subcellular localisation for Animals,Fungi,Plants)
C. Mooney, Y. Wang, G.Pollastri,
"SCLpred: Protein Subcellular Localization Prediction by N-to-1 Neural Networks", Bioinformatics, 27 (20), 2812-2819, 2011.
Abstract and PDF (Bioinformatics web site)
Porter, PaleAle (solvent accessibility), and how we use homology
C. Mooney, G.Pollastri.
"Beyond the Twilight Zone: Automated prediction of structural properties of proteins by recursive neural networks and remote homology information"
Proteins, 77(1), 181-90, 2009.
Abstract and PDF (Proteins web site)
G.Pollastri*, A. J. M. Martin, C. Mooney, A. Vullo.
"Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information"
BMC Bioinformatics, 8:201, 2007.
Open access abstract and PDF (BMC Bioinformatics web site).
BrownAle (contact density), XXStout (contact maps)
I.Walsh, D.Baú, A.J.M.Martin, C. Mooney, A.Vullo, G.Pollastri.
"Ab initio and template-based prediction of multi-class distance maps by two-dimensional recursive neural networks"
BMC Structural Biology, 9:5, 2009.
Open access abstract and PDF (BMC Structural Biology web site).
A. Vullo, I. Walsh, G. Pollastri.
"A two-stage approach for improved prediction of residue contact maps"
BMC Bioinformatics, 7:180, 2006.
Open access abstract and PDF
Porter+ (structural motifs)
C. Mooney, A. Vullo, G. Pollastri.
"Protein Structural Motif Prediction in Multidimensional φ-ψ Space leads to improved Secondary Structure Prediction"
Journal of Computational Biology, 13:8, 1486-1502, 2006.
Abstract and PDF (JCB web site).
XStout (coarse topologies)
G. Pollastri, A. Vullo, P. Frasconi, P. Baldi.
"Modular DAG-RNN Architectures for Assembling Coarse Protein Structures"
Journal of Computational Biology, 13:3, 631-650, 2006.
Abstract and PDF (JCB web site).
A more comprehensive list of publications is available here.
|