Mustguseal Protocol

Mustguseal aims at constructing large alignments of functionally diverse protein families by automatically collecting the available information about their structures and sequences in public databases. To meet this objective the Mustguseal protocol implements a combination of structure and sequence comparison algorithms to take into account sequence and structural variability of functionally diverse homologs within a large superfamily. All bioinformatic routines are fully automated and executed entirely on the server side. This page gives an overview of the Mustguseal protocol. The protocol is also discussed in the Mustguseal publication.

Click to open in full-screen mode

The Mustguseal protocol contains four major steps:

Step 1: Structure similarity search
The default input to the server (i.e., in Mode 1) is a protein structure submitted as PDB ID and chain ID. The structures of proteins are considered to be more conserved throughout the evolution compared to sequences. Therefore, structure similarity search versus the entire PDB database is used to collect remote evolutionary relatives of the query protein. These distant homologs have common structural organization but lost sequence similarity during natural selection and evolution from a common ancestor, and thus are likely to have broad functional variability. The collected set of proteins is expected to represent different protein families with various functions within a superfamily, and will be further referred to as the representative set.

Step 2: Construction of the core structural alignment
The structural alignment of a representative set of homologous proteins is the core of the Multiple Structure-Guided Sequence Alignment which defines its scope and diversity, and will be further referred to as the core structural alignment. It is important that proteins in this core structural alignment represent the desired diversity among the protein families of interest. Users are advised to download this superimposition using a link at the Results page and evaluate: (1) if the automatically selected proteins represent the desired diversity among the protein families of your interest, and (2) if the automatically created structural alignment is accurate (special attention should be paid to flexible loop regions and crucial non-standard/modified amino acids). A user-defined/edited core structural alignment can be submitted as a new task in Mode 2 or Mode 3. Please note that the server always deals with the sequence representation of the core structural alignment (i.e., not the 3D coordinates but the fasta sequence file) and therefore the sequence representation of a structural superimposition should be submitted in Mode 2 or Mode 3.

Step 3: Sequence similarity search
Each protein from the core structural alignment, i.e., each representative protein, is independently used as a query to execute sequence similarity search and collect close evolutionary relatives - members of the represented families. If the Swiss-Prot database was selected, then three independent subsearches using Blosum45, Blosum62 and Blosum80 would be carried out for each query. Alternatively, if the Swiss-Prot+TrEMBL databases were selected, an additional subsearch in the TrEMBL database using the Blosum62 matrix will be performed. Filters are applied to eliminate redundant entries as well as too distant proteins (i.e., outliers) within each group. Sequences within each group are aligned by implementing a multiple sequence alignment algorithm.

Step 4: Structure-guided sequence alignment
Remote evolutionary relatives that have lost sequence similarity during natural selection and specialization from a common ancestor should be compared by structural superimposition, while sequence-based alignments are meaningful only for close homologs. The sequence alignments constructed during the Step 3 contain only close homologs (within each alignment) which were collected by sequence similarity search. Each sequence alignment contains a representative protein. Representative proteins were collected by structure similarity search and can not be compared by their sequences due to significant differences in the alphabet. However, these remote homologs can be compared by means of structural comparison which was performed at the Step 2 when the core structural alignment was built. The final Step 4 of this Protocol is to merged sequence alignments created at Step 3 using the core structural alignment from Step 2 as a guide. Columns of gaps are inserted into individual sequence alignments so that their total lengths become equal and the superposition of the representative proteins in the merged sequence alignment matches their superimposition in the core structural alignment. In other words, the superposition of representative proteins in the core structural alignment remains unchanged, superposition of close homologs within each sequence alignment block remains unchanges, but a new superposition is created for proteins which are present in the sequence alignments but not in the core structural alignment.