Mustguseal Parameters





The user can manipulate the key parameters of the two crucial steps in the Protocol - the Structure similarity search and the Sequence similarity search. This pages gives an overview of these parameters and explains their influence over the results.





Structure similarity search


Protein structure is considered to be more conserved throughout the evolution compared to sequence. Therefore, structure similarity search versus the entire PDB database is used by Mustguseal to collect remote evolutionary relatives of the query protein by selecting hits with a sufficient similarity in structural organization. In Mustguseal the structural similarity is quantified by the alignment coverage - i.e., the ratio of the number of amino acids in the aligned (superimposed) parts of the query and the target structures compared to the full size of the corresponding structures. Only amino acids involved in secondary structure elements are considered to calculate the ratio. By default the thresholds are set to 90% and 90%, i.e., at least 90% of the query have to make at least 90% of the target for it to be selected as a hit. The choice of these thresholds should be individual for each task and should take into account the structural characteristics of a particular family. E.g., the example above shows a single-chain single-domain query protein (magenta) being matched with a single-chain dual-domain target protein (cyan). Approximately 90% of the query makes approximately 50% of the target (i.e., the full query structure is involved in superimposition with the target, but at most half of the target structure is involved in superimposition with the query). To select this target as a match with the query the thresholds should be set to approximately 90% (of the query structure) and 45% (of the target structure).

Generally speaking, if the aim is to construct a more diverse alignment, or if no structures or too few structures similar to the query were found by the structure similarity search with the default setup, the user should consider decreasing the Lowest acceptable match in the query and target structures in the Structure similarity search input box to 70% and 70%, respectively. Proteins which are more evolutionary remote will be collected increasing structural and functional variability of the alignment. The 70%-70% setting can be used to obtain a diverse set of functions within a common structural framework, given equivalent structural dimensions of the proteins in the superfamily. If the dimensions of protein structures (i.e., domain organisation) are not equivalent among the protein families of interest (e.g., see the magenta-cyan case above), one of the parameters should be set to 70% and the other one should be set below 70%, but not less than 30%.

The aim of the structure similarity search is to collect remote evolutionary relatives within the superfamily to serve as the core for the structure-guided sequence alignment of this superfamily. Homologous proteins within a superfamily may have significantly diverged in both sequence and structure during the evolution. Therefore, decreasing the two thresholds to a very low values (e.g. 30% and 30%) may help in identifying all available members of the superfamily of interest, but significantly increases the probability of collecting unrelated (i.e., not homologous) proteins and corrupting the alignment. E.g., the example above shows a 30%-30% match between the query protein (magenta) and the target protein (yellow). It is clear from this comparison that the two proteins are unrelated and their alignment would be meaningless in the context of Mustguseal protocol.



Sequence similarity search

Each representative protein from the core structural alignment is used as a query to run a sequence similarity search. The following parameters influence the outcome of this step.

Selection of a database. The use of UniProtKB/Swiss-Prot database is set as the default. This database provides protein sequences as well as, in general, a trustworthy functional annotation. The downside is its relatively small size. Not all protein families are fairly represented in the Swiss-Prot database. The user could try the UniProtKB/Swiss-Prot+TrEMBL database which is much larger and usually provides more proteins for the alignment. The downside is that the functional annotation provided by the TrEMBL is a prediction (i.e., annotation transfer by similarity with well-studied proteins) and should be considered for information purposes only.

Maximum number of sequences to collect in each subsearch. If the UniProtKB/Swiss-Prot database was selected then three subsearches would be carried out - using the Blosum45, Blosum62, and Blosum80 matrixes. If the UniProtKB/Swiss-Prot+TrEMBL database was selected then four subsearches would be carried out - three in the Swiss-Prot database using the Blosum45, Blosum62, and Blosum80 matrixes, and one in the TrEMBL database using the Blosum62 matrix. The Maximum number of sequences to collect in each subsearch parameter limits the maximum number of sequences to be collected after each subsearch. Keep in mind, that this parameter regulates only the size of the 'raw' set, and all sequences collected by the similarity search will be further filtered for sequence length, redundancy and dissimilarity with the representative protein (see description of the three filters below), thus the size of the final set of sequences can be significantly reduced. The default value of 500 for this parameter can be implemented for good in the absolute majority of cases. If your final alignment turned out to be too large you may want to decrease this parameter (e.g., to 50) to limit the number of proteins collected during the similarity search. If no sequences or too few sequences similar to the representative protein were found by the sequence similarity search you should check the output logs for the corresponding search (i.e., the seqsearch_PDBID.stdout.log file in the BLAST_PDBID folder, see the Explanation of the Output for more details). If the number of proteins collected in each subsearch is up to the limit (e.g., 500 proteins) but they are being dismissed as redundant (too similar to each other) during the further filtering, you could increase this parameter (e.g., to 1000) in an attempt to collect more diverse proteins.

Redundancy filter threshold (%). Sequences collected by sequence similarity search in the selected database are further filtered for non-redundancy. By default, only one sequence is preserved from each cluster of sequences which share at least 95% similarity, and all others are dismissed from further consideration. Aligning sequences with more than 95% identity would increase the computation cost while adding doubtful information value. Therefore, releasing this threshold (i.e., setting it to values >95%) could be justified only for a particular purpose. The user could consider tightening this threshold (i.e., setting it to values <95%) to reduce the total number of proteins when constructing alignments of very large superfamilies (e.g., the alpha-beta hydrolases). Generally speaking, you should not aim at constructing a very large alignment (>5000-10000 proteins). Very large alignments are impractical as they would most certainly contain redundant information and would be computationally hard to analyze.

Dissimilarity filter threshold (bit score per column). The selected set of proteins is further filtered to eliminate too distant proteins which would likely cause errors during the sequence alignment. The dissimilarity threshold is quantified in bit scores per column and describes the entropy per column in the pairwise alignment of a selected protein with the representative protein (which was used as a query to run this similarity search). By default, proteins with at least 0.5 bit score per column with the representative protein are preserved, and all others are dismissed from further consideration. Set this parameter to a higher value to preserve only proteins which are more similar to the query, or to a lower value to allow less similar proteins in the sequence alignment. The bit score per column may take a wide range of values, however, the most commonly used are values within the [0; 1] range (usually 0.25 or 0.5). Therefore, not to confuse the user with a large selection of rarely used settings, we limit the Dissimilarity filter threshold to [0; 1].

Sequence length filter threshold (%). By default all proteins which deviate more than 20% in length from the reference protein sequence are rejected in the corresponding sequence similarity search. This simple filter is very helpful at eliminating protein fragments or spooky huge sequences with 'Unknown' for annotation which are frequent in the sequence databases, especially the TrEMBL database. Introduction of these 'proteins' could disrupt a sequence alignment (by causing large gaps in sequences of 'normal' proteins) while adding doubtful information value. However, in some cases it is crucial to release the Sequence length filter threshold. If your query PDB corresponds to a fragment of a protein chain or a protein which is represented by a larger precursor sequence in the sequence database you should set this parameter to a higher value.
If you get a similar warning

Warning: Sequence similarity search for the representative protein 0_1gm9_B has returned only itself
you should download the archive with sequence similarity search results, enter the corresponding folder (i.e, BLAST_0_1gm9_B for this example), and check the log (seqsearch_0_1gm9_B.stdout.log) for output like this:
Info: Sequence P06875 has length 846 and will be dismissed (151.9 % of the reference sequence length)
Info: Sequence P07941 has length 844 and will be dismissed (151.5 % of the reference sequence length)
Info: Sequence P15558 has length 774 and will be dismissed (139.0 % of the reference sequence length)

This warning happens because the P06875, P07941, and P15558 are sequences of the precursor protein, which includes chain A, the linker, and chain B, while the PDB file 1GM9:B corresponds to the chain B only. In order to include sequences of proteins P06875, P07941, and P15558 in the alignment set the Sequence length filter threshold to 60% (i.e., to allow variations in length in a range 40%-160% of the length of reference protein sequence), or you may effectively switch the filter off by setting it to 1000%.