Preparing the input data for visualCMAT analysis


 

The input to the visualCMAT is (1) a multiple protein alignment in FASTA format and (2) a representative protein structure in PDB format.

The input to visualCMAT data can be automatically prepared by the sister web-server Mustguseal. Mustguseal can automatically construct large structure-based sequence alignments of functionally diverse protein families that include thousands of proteins based on all available information about their structures and sequences in public databases. Large structure-guided sequence alignments of functionally diverse families that include thousands of proteins can be automatically constructed using that public web-server and then automatically submitted to the visualCMAT web-server in one click.

Press to run the Mustguseal web-server with a pre-selected set of parameters and automatically construct a multiple alignment of your protein families for further analysis by the visualCMAT.

Then, at the Mustguseal Results page scroll down and use the Submit to visualCMAT button to automatically upload the Mustguseal alignment to the visualCMAT web-server:

Alternatively, you can prepare the input manually. You have to submit two files to the visualCMAT server - a multiple protein alignment in FASTA format and a representative protein structure in PDB format. To select a protein chain for the bioinformatic analysis use the "Select a protein chain" field - you can type in a particular chainID (case sensitive) or leave it to "first" to select the first chain that appears in the PDB file. The representative protein structure should match one protein in the sequence alignment. You can set the "Set the ID of the representative protein in the multiple alignment" field to "auto" for the server to automatically match the PDB structure with the alignment to select the best pairwise sequence superimposition, or set a number corresponding to the order of appearance of the representative protein in the multiple sequence alignment file (the numbering starts from "1").

 

General guidelines for manual preparation of the input files

The multiple sequence alignment should represent the desired diversity among the protein families of interest (i.e., contain functionally diverse homologous proteins selected for a particular task). The representative protein structure is expected to correspond to a protein in the sequence alignment. Choose the representative protein based on your particular task and primary interest. It can be the target protein selected for the further experimental design, the most studied member of the superfamily, or a protein which you are the most familiar with. You should always aim at submitting a representative PDB structure that corresponds to a protein in the alignment with at least 95% pairwise sequence identity. If structural information is not available for all proteins in your alignment then you could use the structure of a very close homolog from the PDB database or build a 3D model of the representative protein based on the available structural data using the homology modeling (e.g., with the help of the highly capable Modeller software).

The visualCMAT annotation includes binding sites prediction. If your protein has multiple chains (e.g., A and B) and you multiple alignment represents only a single chain (e.g., A) you should still submit the full-size complete protein structure for adequate prediction of pockets and cavities.

 

General requirements for the input files

The input multiple alignment:

  • should contain protein amino acid sequences;
  • should contain at least six proteins;
  • should be an alignment (i.e., not just sequences, but aligned sequences, i.e., "sequences with gaps");
  • the "-" character should be used for a gap;
  • the special characters in the protein names are not allowed and will be automatically substituted for "_";
  • the very long protein names will automatically truncated to the first 100 characters;
  • the special characters in the protein sequences are not allowed and will be automatically substituted for gaps;
  • should be in the FASTA format (not ClustalW, not Phylip, etc.). If you do not know what is the format of your alignment - submit it to visualCMAT and you`ll find out. If you have your alignment in the wrong format use a sequence format converter, e.g., sequenceconversion.bugaco.com;


The input protein structure:

  • should contain the coordinates of amino acids atoms of one protein;
  • should be in the PDB format;
  • should correspond to (i.e., ideally should be 100% identical to) one protein sequence in the multiple alignment;
  • may not be 100% identical to any protein sequence in the multiple alignment. The preprocessing script will automatically select the representative sequence from the multiple alignment by the best pairwise match between your PDB structure and any sequence in the alignment. All inconsistencies between the representative protein structure and sequence will be removed. You should always aim at submitting a representative PDB structure that corresponds to a protein in the alignment with at least 95% pairwise sequence identity. You will be allowed to proceed with up to 50% sequence similarity between the two, however, this may cause errors during the bioinformatic analysis.
  • should contain all chains of the biological unit (e.g., A, B, C) even if the multiple alignment contains the sequences of only one chain (e.g., A);
  • may contain heteroatoms. All non-protein atoms (e.g., of a substrate) should have the HETATM prefix in the PDB file. Non-canonical amino acids will be automatically changed to the canonical equivalents (i.e., SME/MSE to MET). Ligands, cofactors, solvent and other instances will not be used for the bioinformatic analysis but will be used to prepare the graphical output and can help with the interpretation of functional and regulatory significance of the predicted co-evolving positions.