Mustguseal Protocol





Mustguseal aims at constructing large alignments of protein families from all available information about their structures and sequences in public databases. To meet this objective the Mustguseal protocol takes into account complex structural and functional variability of proteins within a large superfamily by implementing a combination of structure and sequence alignment procedures. All bioinformatic routines are fully automated and executed entirely on the server side. This page gives an overview of the Mustguseal protocol. The detailed description of the protocol is provided in the Mustguseal publication.




Click to open in full-screen mode

The Mustguseal protocol contains four major steps:


Step 1: Structure similarity search
The default input to the server (i.e., in Mode 1) is a protein structure submitted as PDB ID and chain ID. Protein structure is considered to be more conserved throughout the evolution compared to sequence. Therefore, structure similarity search versus the entire PDB database is used to collect remote evolutionary relatives of the query protein. These proteins have common structural organization but lost sequence similarity during natural selection and evolution from a common ancestor, and thus are likely to have broad functional variability. The selected set of proteins is expected to represent different protein families within a superfamily, and will be further referred to as the representative set.


Step 2: Construction of the core structural alignment
A multiple structural alignment of the representative set of proteins is created. The structural alignment of a representative set of homologous proteins is the core of the Multiple Structure-Guided Sequence Alignment, and will be further referred to as the core structural alignment. It is important that proteins in this core structural alignment represent the desired diversity among the protein families of interest. Users are advised to download this superimposition using a link at the Results page and evaluate: (1) if the automatically selected proteins represent the desired diversity among the protein families of your interest, and (2) if the automatically created structural alignment is accurate (special attention should be paid to flexible loop regions and crucial non-standard/modified amino acids). A user-defined/edited core structural alignment can be submitted in Mode 2 or Mode 3. Please note that the server always deals with the sequence representation of the core structural alignment (i.e., not the 3D coordinates but the fasta sequence file) and therefore the sequence representation of a structural superimposition should be submitted in Mode 2 or Mode 3.


Step 3: Sequence similarity search
Each protein from the core structural alignment (a representative protein) is independently used as a query to execute sequence similarity search and collect their close evolutionary relatives (members of the corresponding families). Filters are applied to eliminate redundant entries as well as too distant proteins (i.e., outliers) within each group. Sequences within each block are aligned by implementing an appropriate multiple sequence alignment algorithm.


Step 4: Structure-guided sequence alignment
Remote evolutionary relatives that have lost sequence similarity during natural selection and specialization from a common ancestor should be compared by structural superimposition, while sequence-based alignments are meaningful only for close homologs. The sequence alignments constructed during the Step 3 contain only close homologs (within each alignment) which were collected by sequence similarity search. Each sequence alignment contains a representative protein. Representative proteins were collected by structure similarity search and can not be compared by their sequences due to significant differences. However, these remote homologs can be compared by means of structural comparison which was performed at the Step 2 when the core structural alignment was build. The final Step 4 of this Protocol is to merged sequence alignment created at Step 3 using the core structural alignment from Step 2 as a guide. During this merge gaps are being introduced into sequences of representative proteins within the corresponding sequence alignments until their superposition across all sequence alignments becomes identical to that in the core structural alignment. Introduction of a gap into the reference protein sequence shifts all amino acids in the corresponding column of a sequence alignment. In other words, the superposition of reference proteins in the core structural alignment remains unchanged, superposition of close homologs within each sequence alignment block remains unchanges, but a new superposition is created between proteins which are present in the sequence alignments but not in the core structural alignment. These proteins in different sequence alignment blocks, which are remote relatives to each other but do not have structural information of them available, are aligned by using the structural alignment of representative proteins as a guide, i.e. structure-guided sequence alignment.