Input to Mustguseal





Mustguseal can be used to build focused alignments of selected protein families or collect and superimpose a large set of related proteins within a superfamily. The scope of the final alignment is defined by the diversity of representative proteins in the core structural alignment which can be created on-site or submitted by the user. There are three ways of submitting a task to the Mustguseal. Mode 1 is the default, fully automated, and the easiest way to obtain your alignment. Modes 2 and 3 provide an opportunity to refine the alignment by editing the components it is built from. The three modes provide a full control over the alignment construction process. Users are advised to use the Mode 1 as the default and then switch to Mode 2, and then to Mode 3, if necessary. This page describes how you can use Mustguseal to build a large alignment of your protein families for a particular purpose.




There are three input modes and the results claim form available on the Mustguseal submission page:



Mode 1: Submit a query protein

In Mode 1 steps 1 - 4 of the Mustguseal protocol are executed. This mode will automatically collect and align all structures and sequences of proteins homologous to your query and produce the final structure-guided sequence alignment. Users are advised to use this mode as the default.

Guideline for the query selection. In Mode 1 you submit PDB and chain IDs of a query protein. Select the query protein based on your particular task and primary interest. It can be the target protein selected for the further experimental design, the most studied member of the superfamily, or a protein which you are the most familiar with. If structure of the selected protein is made up of identical subunits (e.g., homotetramer) then choice of the chain is arbitrary. If structure of the selected protein is made up of different subunits (e.g., heterodimer) you should start from submitting the ID of a chain which is primary to protein function (e.g., contains the catalytic residues and the active site of an enzyme). Generally speaking, if structure of the selected protein is made up of different subunits you should build a separate alignment for each chain. Please note that the chain ID is case sensitive (i.e., 'A' and 'a' will be considered as different inputs).

Note for incomplete PDB structures. Incomplete structures are a common case in the PDB database. The most prominent problems are modified/non-standard residues annotated as heteroatoms and missing flexible loops. When running in Mode 1 Mustguseal collects and aligns protein structures as they are. In particular, all heteroatoms are excluded from the structural alignment. You should check if the structures which were used to build the core structural alignment in Mode 1 are incomplete (you can download the respective data at the Results page), and if they are - evaluate the importance of the missing/modified regions for your study. You could build the complete models of your protein structures using the molecular modeling software (e.g., Modeller), align them locally on your computer or supercomputer, and then submit this user-defined core structural alignment to the server in Mode 2 or Mode 3.



Mode 2: Submit a core structural alignment

This mode provides an opportunity to submit a user-defined core structural alignment to construct a structure-guided sequence alignment of the selected protein families and their closest homologs. In Mode 2 steps 3 and 4 of the Mustguseal protocol are executed.

The structural alignment of a representative set of homologous proteins is the core of the Multiple Structure-Guided Sequence Alignment. It is important that proteins in this core structural alignment represent the desired diversity among the protein families of interest for a particular research objective. The user may wish to include different structures in the core structural alignment (i.e., different from what has been selected automatically by the server in Mode 1) or alter in any way the alignment itself (e.g., by manually editing local amino acids superposition).

File Format Requirements in Mode 2. The sequence representation of the core structural alignment (i.e., not the 3D coordinates but the fasta sequence file) should be submitted in Mode 2. One text file (flatfile) with the '.fasta_aln' extension has to be submitted with sequences in the FASTA format. If the file describes only one protein then its sequence should be without gaps. If the file describes two or more proteins then their sequences must be aligned - i.e., the total length of protein amino acid sequence plus gaps must be the same for all proteins in the file. Protein names should not exceed 100 characters in length and should not contain special characters. Protein sequences should not contain the 'Z' character as well as any special characters or numbers.



How to build a core structural alignment for Mode 2?

A user-defined 3D-alignment can be built from a selected set of protein structures on a local computer/supercomputer or a third party web-server and then submitted to the Mustguseal server in Mode 2. This provides an opportunity to construct a structure-guided sequence alignment of the selected protein families and their closest homologs. The general guideline for customizing the core structural alignment follows:

  • Submit a query protein in Mode 1, download the archive with the structure similarity search results, and check the superimpose.list file for the full list of structural similarities with the query. For each pairwise match the PDB annotation will be provided. The 95%-non-redundant set of protein structures will be provided in the results_nr95/ folder (see the Explanation of the Output for more details). Select representative proteins for the core structural alignment based on your particular task and primary interest. E.g., if you want to compare homologous enzymes with amidase and lipase activities, then pick the structures from the non-redundant set with the respective annotation);

  • Create the structural alignment of the selected proteins - on your local computer/supercomputer or at the third-party web-server. We recommend the MATT algorithm to compare remote evolutionary relatives (Menke et al., 2008). MATT searches for compatible pairs of fragments and permits structural allowances such as twists and translations, and demonstrates good performance in aligning distant relationships and length variations (Kalaimathy et al., 2011). MATT can be downloaded and installed locally or executed on the original web-server (http://matt.cs.tufts.edu/). MATT will produce the sequence representation of the core structural alignment (i.e., the '.fasta' file) which can be submitted to this server;

  • We recommend parMATT (https://biokinet.belozersky.msu.ru/parmatt) to build the core structural alignment from a large collection of protein structures. The parMATT is a parallel implementation of the MATT algorithm and is intended for distributed-memory systems (i.e., computing clusters and supercomputers hosting memory-independent computing nodes). parMATT can significantly accelerate the time-consuming process of building a large structural alignment. parMATT takes protein structures in the PDB format as input and produces structural alignment in both the PDB and FASTA formats, the latter being fully compatible with the Mustguseal Mode 2 input requirements.



Mode 3: Submit a core structural alignment and results of sequence similarity search

In Mode 3 steps 3 - 4 of the Mustguseal protocol are executed. This mode provides an opportunity to submit a user-defined core structural alignment as well as sequence alignment blocks of close homologs corresponding to each representative protein in the core structural alignment. User may alter in any way the results of sequence similarity search obtained automatically in Mode1 or Mode 2 (i.e. choose different proteins or change the way sequences are being superimposed within each group) and then submit all building blocks of the alignment in Mode 3. Please note that the sequence representation of the core structural alignment (i.e., not the 3D coordinates but the fasta sequence file) should be submitted in Mode 3.

File Format Requirements in Mode 3. Two file have to be submitted. The first file is the core structural alignment - see the File Format Requirements in Mode 2 above. The second file is an archive with sequence alignment files. This file must have the '.tgz' extension and correspond to a TAR+GZIP archive. To create this file use the command

tar czf upload.tgz folder_with_sequence_alignment_files

in Linux. In Windows use a free tool 7-zip to pack sequence alignment files into the '.tar.gz' archive and then manually change the file extension to '.tgz'. The archive must contain a collection of text files (flatfiles) with the '.fasta_aln' or '.final.fasta_aln' extension and sequences in the FASTA format. The structure of subcatalogs within the archive (i.e., the number of folders and their names) is not restricted given that each text file has a unique name. The technical requirements for the content of these files are the same as for the core structural alignment (see above). The names of sequence alignment files (e.g., pdb1.fasta_aln) must correspond to names of the representative proteins (e.g. pdb1) in the core structural alignment. Each sequence alignment file must include the respective representative protein and its amino acid sequence must be identical to that in the core structural alignment file. Ensure that the archive contains exactly one sequence alignment file for each representative protein in the core structural alignment. Duplicate protein names within one alignment file are not allowed.

If you have a problem preparing the input in Mode 3 you should choose a query protein and submit it in Mode 1 (or submit the core structural alignment in Mode 2), then download the core structural alignment and the corresponding sequence alignments, and try using the automatically created alignments as a template to create your new input.


Claim your results by TaskID
The results and progress log of a previously submitted task can be accessed on entering a 16-symbol TaskID in the corresponding form available at the Mustguseal submission page.