Mustguseal Parameters





User defines the key parameters of the two crucial steps in the Protocol - the Structure similarity search and the Sequence similarity search. This pages gives an overview of these parameters and explains their influence over the results.





Structure similarity search


Protein structure is considered to be more conserved throughout the evolution compared to sequence. Therefore, structure similarity search versus the entire PDB database is used by Mustguseal to collect remote evolutionary relatives of the query protein by selecting hits with a sufficient similarity in structural organization. In Mustguseal the structural similarity is quantified by the alignment coverage - i.e., the ratio of the number of amino acids in the aligned (superimposed) parts of the query and the target structures compared to the full size of the corresponding structures. Only amino acids involved in secondary structure elements are considered to calculate the ratio. By default the thresholds are set to 90% and 90%, i.e., at least 90% of the query have to make at least 90% of the target for it to be selected as a hit. The choice of these thresholds should be individual for each task and should take into account the structural characteristics of a particular family. E.g., the example above shows a single-domain query protein (magenta) being matched with a single-chain dual-domain target protein (cyan). Approximately 90% of the query makes ~45% of the target (i.e., the full query structure is involved in superimposition with the target, but at most half of the target structure is involved in superimposition with the query). To select this target as a match with the query the thresholds should be set to at most 90% (of the query structure) and 45% (of the target structure).

Generally speaking, if the user wishes to construct a more diverse alignment, or if no structures or too few structures similar to the query were found by the structure similarity search with the default setup (90% and 90%), the user should consider decreasing the Lowest acceptable match in the query and target structures in the Structure similarity search input box to 70% and 70%, respectively. Proteins which are more evolutionary remote will be collected increasing structural and functional variability of the alignment. The 70%-70% setting is usually optimal to obtain a diverse set of functions within a common structural framework, given equivalent structural dimensions of the majority of proteins in the superfamily. If the dimensions of protein structures are not equivalent among the protein families of interest (e.g., see the magenta-cyan case above), one of the parameters should be set to 70% and the other one should be set below 70%, but not less than 30%.

The aim of the structure similarity search is to collect remote evolutionary relatives within the superfamily to serve as the core for the structure-guided sequence alignment of this superfamily. Homologous proteins within a superfamily may have significantly diverged in both sequence and structure during the evolution. The user should note that a sufficient structural similarity between two proteins is an important criterion for defining homology in bioinformatics. Therefore, decreasing the two thresholds to a very low values (e.g. 30% and 30%) may help in identifying all available members of the superfamily of interest, but significantly increases the probability of collecting unrelated (i.e., not homologous) proteins and corrupting the alignment. E.g., the example above shows a 30%-30% match between the query protein (magenta) and the target protein (yellow). It is clear from this comparison that the two proteins are unrelated and their alignment would be meaningless in the context of Mustguseal protocol.



Sequence similarity search

Each representative protein from the core structural alignment is used as a query to run independent sequence similarity searches. The following parameters influence the outcome of this step.

Selection of a database. The use of UniProtKB/Swiss-Prot database is set as the default. This database provides protein sequences as well as, in general, a trustworthy functional annotation. The downside is its relatively small size. Not all protein families are fairly represented in the Swiss-Prot database. The user could try the UniProtKB/Swiss-Prot+TrEMBL database which is much larger and usually provides more proteins for the alignment. The downside is that the functional annotation provided by the TrEMBL is a prediction (i.e., annotation transfer by similarity with well-studied proteins) and should be considered for information purposes only.

Redundancy filter threshold (%). Sequences collected by sequence similarity search in the selected database are further filtered for non-redundancy. By default, only one sequence is preserved from each cluster of sequences which share at least 95% similarity, and all others are dismissed from further consideration. Aligning sequences with more than 95% identity would increase the computation cost while adding doubtful information value. Therefore, releasing this threshold (i.e., setting it to values >95%) could be justified only for a particular purpose. The user could consider tightening this threshold (i.e., setting it to values <95%) to reduce the total number of proteins when constructing alignments of very large superfamilies (e.g., the alpha-beta hydrolases). Generally speaking, you should not aim at constructing a very large alignment (>5000-10000 proteins). Large alignments are impractical as they would most certainly contain redundant information and would be computationally hard to analyze.

Dissimilarity filter threshold (bit score per column). The selected set of proteins is further filtered to eliminate too distant proteins which would likely cause errors during the sequence alignment. The dissimilarity threshold is quantified in bit scores per column and describes the entropy per column in the pairwise alignment of a selected protein with the representative protein (which was used as a query to run this similarity search). By default, proteins with at least 0.5 bit score per column with the representative protein are preserved, and all others are dismissed from further consideration. The bit score per column may take a wide range of values. However, for the past 10 years we never used anything but values within the [0; 1] range (usually 0.25 or 0.5). Therefore, not to confuse the user with a large selection of rarely used settings, we limit the Dissimilarity filter threshold to [0; 1]. Would you require to apply a different filtering procedure to the results of the sequence similarity search you should submit your user-prepared input to the Mustguseal in Mode 3.

Sequence length filter threshold (%). By default all proteins which deviate more than 20% in length from the reference protein are rejected in the corresponding sequence similarity search. This simple filter is very helpful at eliminating protein fragments or spooky huge sequences with 'Unknown' for annotation which are frequent in the TrEMBL database. Introduction of these 'proteins' can disrupt a sequence alignment (by causing large gaps in sequences of 'normal' proteins) while adding doubtful information value. However, in some cases it is crucial to release the Sequence length filter threshold. If your query PDB corresponds to a fragment of a protein chain or a protein which is represented by a larger precursor sequence in the sequence database you should set this parameter to a higher value.