visualCMAT

A web-server to select and interpret
correlated mutations/co-evolving residues in protein structures



Have a protein family? ...





... to predict and visualize
correlated mutations/co-evolving residues
in protein structures



Version 1.0; August 4th 2017


  • Submit a multiple protein alignment in FASTA format and a representative protein structure in PDB format to the visualCMAT server. You can build a large alignment of your protein families automatically using the Mustguseal server;

  • Multiple alignments of thousands of proteins can be handled by this server;

  • The visualCMAT server will automatically match your representative protein with the multiple alignment, predict correlated mutations/co-evolving residues, and create the visualCMAT annotation based on the bioinformatic, statistical, and structural analyses;

  • The CMAT algorithm [Jeong & Kim (2012) Protein. Eng. Des. Sel.] is implemented to predict the correlated mutations/co-evolving residues;

  • All steps of the visualCMAT server protocol are executed entirely on the server side. You do not need any specific software on your side;

  • Results of the visualCMAT server can be downloaded to your computer for local use or studied on-line using the built-in interactive analysis tools. Interactivity is implemented in HTML5 and therefore no plugins nor Java are required;

  • Press the Run visualCMAT on-line button and then press the Demo mode button to request a demonstration of the visualCMAT server.





Co-evolving positions in a protein superfamily is where the mutations in one region of protein structures is compensated by mutations in another region, and can be used to highlight pairs of structurally and functionally important residues. Studying of co-evolution in protein structures can help to understand the relationship between structure and function in proteins. Correlated positions are considered as hotspots for rational design and directed evolution experiments to produce mutant enzymes with improved properties, and studied to understand the mechanism of allosteric communication between multiple ligand binding centers in protein structures.

The visualCMAT server can automatically identify correlated mutations/co-evolving positions in a multiple alignment of a protein superfamily, and annotate the structures of representative proteins according to the bioinformatic and statistical analysis of the predicted correlations, and structural analysis of the potential binding sites in the protein structures. Visual representation of the correlated mutations can help selecting the most important positions in protein structures, interpret their implication to protein function and regulation, and study the structure-functional relationship in protein superfamilies. The visualCMAT is available as a fully automated web-server and as a perl script front-end to the popular CMAT tool.




Build large alignments of protein families automatically with the Mustguseal server

You can use the Mustguseal server to automatically construct large multiple alignments of protein families from all available information about their structures and sequences in public databases. Multiple alignments of thousands of protein sequences and structures can be automatically constructed using that public web-server.




Navigation:




[To Navigation]
Prerequisites for using visualCMAT

The visualCMAT web-server. All steps of the visualCMAT server protocol are executed entirely on the server side. You do not need any specific software on your side. The Analysis section of the Results page offers interactive content for analysis of the visualCMAT annotation. Interactivity is implemented in HTML5 and therefore no plugins nor Java are required. The only prerequisite for viewing the Analysis section of the Results page is a HTML5-compatible web-browser. Current versions of all major browsers support HTML5 as the default standard. The server has been successfully tested with Google Chrome, Mozilla Firefox, KDE Konqueror, and Microsoft Edge. Please note that we have had compatibility issues with the Microsoft Internet Explorer and therefore this browser is currently not supported by the visualCMAT web-server.

The local use of the visualCMAT. The visualCMAT tool is a Perl script that takes the CMAT output file and a representative PDB structure as input. The visualCMAT invokes third-party programs during its execution. You will have to meet the following prerequisites in order to run visualCMAT on your local computer:

  • Linux Operating System: The visualCMAT was originally designed to be executed from the shell on Linux-operated machines. Linux virtual machines (e.g., operated by VirtualBox) are also OK;

  • CMAT Correlated Mutation Analysis Tool: CMAT has to be executed manually by the user prior to running visualCMAT. The CMAT analysis can be performed on a web-server or the corresponding program can be downloaded for local use. The CMAT web-server can be accessed at binfolab12.kaist.ac.kr/cmat (external link). The CMAT source code can be downloaded using the link cmat_0.3.2.tar.gz (external link). Pre-built CMAT binaries for 32-bit and 64-bit systems can be downloaded from cmat_static_32.gz (external link) and cmat_static_64.gz (external link), respectively;

  • Perl interpreter: The visualCMAT does not depend on any non-standard modules so any modern Perl distribution will do. Any Linux distribution has a Perl interpreter already installed;

  • MAFFT multiple sequence alignment program: This tool is invoked by the visualCMAT to align the representative protein sequence from the CMAT output file to the sequence of the protein in the PDB file. The source code can be downloaded from mafft.cbrc.jp/alignment/software (external link);

  • FPocket protein cavity detection algorithm: This tool is invoked by the visualCMAT to predict binding sites at the surface of protein structure. The source code of Fpocket v.2.0 can be downloaded from fpocket.sourceforge.net (external link);

  • PyMol molecular graphics engine: This tool is invoked by the visualCMAT to compile the PSE session with structure-based annotation of CMs. PyMol can be installed from Linux repository (e.g., zypper in pymol in openSuSE), compiled from sources or purchased from the distributor Schrödinger LLC (external link).




[To Navigation]
Preparing the input data

  • The multiple sequence alignment should represent the desired diversity among the protein families of interest. You can use the Mustguseal server to automatically construct large multiple alignments of protein families from all available information about their structures and sequences in public databases - all you need is to submit a single query protein as a PDB ID. Multiple alignments of thousands of protein sequences and structures can be automatically constructed using that public web-server. A multiple alignment created by the Mustguseal can be automatically uploaded to the visualCMAT server in one click;

  • The representative protein structure is expected to correspond to a protein in the sequence alignment. You should always aim at submitting a representative PDB structure that corresponds to a protein in the alignment with at least 95% pairwise sequence identity. If structural information is not available for all proteins in your alignment then you could use the structure of a close homolog from the PDB database or build a 3D model of the representative protein based on the available structural data using the homology modeling (e.g., with the help of the highly capable Modeller software).




[To Navigation]
Running CMAT locally
Prior to running CMAT you should select the representative protein from the list of proteins in your multiple sequence alignment. Select the representative protein based on your particular task and primary interest. It can be the target protein selected for the further experimental design, the most studied member of the superfamily, or a protein which you are the most familiar with. Importantly, the representative protein must have a 3D model available, either in the PDB or predicted, e.g., by a homology modeling.

CMAT supports various formats of a multiple sequence alignment. We recommend the FASTA format as the most convenient for manual editing.

When running CMAT/visualCMAT locally ensure that the selected representative protein is the first in the multiple sequence alignment (MSA) file. You can edit the MSA file in a text editor (e.g., Kate) to move the corresponding sequence to the top of the file. To check the protein order in the FASTA formatted MSA file execute a simple command in Linux shell:

cat MSA.fasta | grep '>' | nl

The output would look like this:


1 >0_1p38_A
2 >870_2xkd_A
3 >981_2g15_A
4 >P23049
5 >900_4uzh_A
6 >Q75LR7
...


The selected representative protein must be the first in this list

Now execute the CMAT routine:

./cmat_static_64 MSA.fasta -o MSA.cmat -v 100 -b 100 -p 100

The CMAT analysis will take some time depending on the size of the alignment and speed of your CPU. On successful completion the MSA.cmat will become available. The amino acid numbering in the CMAT output file will be according to the representative protein (the first in the MSA file). The sequence of the representative protein will be printed at the top of the CMAT output file.




[To Navigation]
Running visualCMAT locally
To run visualCMAT you need a CMAT output file and a PDB structure file of the representative protein which was used to run CMAT (see above). When this data is available and all software prerequisites are met you are ready to execute visualCMAT in the Linux shell:


./visualCMAT2.pl predictions.cmat representative.pdb output_prefix <"zp" or "zc">

The visualCMAT analysis includes the following steps:

  • Comparison of the representative protein amino acid sequence (printed at the top of the CMAT output file) with the sequence derived from the CA coordinates of the representative PDB file. These sequences should ideally match, however, gaps are allowed in the corresponding pairwise superimposition which is performed by the MAFFT program. When running CMAT/visualCMAT locally mismatches are not allowed and would cause the visualCMAT to terminate;

  • The CMAT output file will be processed. The CMAT algorithm implements two independent scoring functions to estimate the residue correlation - the MIc / Zc and MIp / Zp statistics. The visualCMAT can use either the c- or p-statistics to prepare the graphical output. The predictions created with the two settings are usually equivalent;

  • The FPocket protein cavity detection algorithm will be invoked by the visualCMAT to predict binding sites at the surface of the representative protein structure. Seven iterations of FPocket will be performed with the crucial threshold of Minimum number of a-sphere per pocket taking values in a range from 10 (smaller pockets are preferred) to 50 (large pockets are preferred);

  • The PyMol instruction script with structure-based annotation of the representative protein PDB will be created and then PyMol will be automatically invoked by visualCMAT to execute this script and save the results as a binary PSE session.




[To Navigation]
The visualCMAT output


The visualCMAT server will provide the following set of output files upon successful completion of the task processing:

  • The VisualCMAT annotation file is a PyMol 'PSE' session file which contains the representative protein structure annotated according to the bioinformatic, statistical and structural analyzes of the predicted correlated mutations/co-evolving residues;

  • The CMAT list of correlated pairs is a plain text file listing all predicted correlated positions. You may want to see this file to learn the occurrence of amino acids in each pair of positions. Please see the CMAT help for a detailed description of the CMAT output file format;

  • The Correlated positions ID table is a plain text file which provides for each position its ID in the representative protein structure and its ID in the sequence, which corresponds to that representative protein in the multiple sequence alignment. This file will be required to compare the visualCMAT graphical representation of the correlated positions and the CMAT output file with amino acid occurrence statistics;

  • The VisualCMAT PyMol script pack contains all information required to recompile the visualCMAT annotation 'PSE' file. The files are packed in 'tar.gz' archive. To extract files from a 'tar.gz' archive use the command tar xzf visualcmat_TaskID.tar.gz in Linux and in Widows use a free 7-zip tool. To recompile the 'PSE' session run the command pymol -c visualcmat/visualcmat_TaskID_pymol.py in Linux or by using "File" → "Run Script" interface menu in Windows. When running the script it is important to maintain the relative path (i.e., the representative protein structure file will be extracted to the current folder ./ and the other files to the visualcmat folder, and all these files must be accessible by these paths when launching the script).

When running the visualCMAT manually as a standalone script the following set of output files will be created in the user-defined output folder upon successful completion of the task processing:

  • The VisualCMAT annotation file is a PyMol 'PSE' session file which contains the representative protein structure annotated according to the bioinformatic, statistical and structural analyzes of the predicted correlated mutations/co-evolving residues;

  • The Correlated positions ID table is a plain text file which provides for each position its ID in the representative protein structure and its ID in the sequence, which corresponds to that representative protein in the multiple sequence alignment. This file will be required to compare the visualCMAT graphical representation of the correlated positions and the CMAT output file with amino acid occurrence statistics;

  • The VisualCMAT PyMol script is a python file required to recompile the visualCMAT annotation 'PSE' file. To recompile the 'PSE' session run the command pymol -c output/output_pymol.py in Linux or by using "File" → "Run Script" interface menu in Windows. When running the script it is important to maintain the relative path (i.e., the representative protein structure file must be in to the current folder ./ and the other files - in the ./output folder).


The primary output of the visualCMAT is the annotation of the representative protein structure according to the bioinformatic, statistical and structural analyzes of the predicted correlated mutations/co-evolving residues, packed in a PyMol binary 'PSE' file. The PSE file can be opened by the PyMol Molecular graphics engine. The advantages of the PSE format is that advanced structure annotation can be easily saved, stored and transferred. The only significant disadvantage is that the PSE format is not backwards compatible (e.g., PSE file created by PyMol version 1.8.x might not open by PyMol version 1.6.x, but you could still try). There could be some minor compatibility issues even within the same minor version (e.g., within the 1.8.x version). Nevertheless, the PSE standard is a very useful tool. To open this 'PSE' session file compiled by the visualCMAT server you will need PyMol v. 1.7.3.0 or higher.

The PyMol PSE file contains a multi-layered annotation, i.e., several layers with different information can be turned on and off by the user (see an example below). Each layer can be studied independently or in a combination with other layers to help the user in interpreting the structure-function relationship of the predicted correlations in protein structure.

The visualCMAT calculates the sum of Z-scores for each position i as Zi = ∑j Zi,j, where Zi,j corresponds to either Zc or Zp statistics for the predicted pair of correlated residues i and j. The statistics (c or p) to be used is defined in the visualCMAT input settings.

The annotation layers provided by the visualCMAT are further discussed in more details.

Layer 1: Gradient paint of amino acids according to the best correlation


Each position in the representative protein structure is gradient-painted according to the predicted correlation Z-scores with other positions in the structure. If one position participates in more than one correlation with other positions then the largest Z-score is used. Z-scores are CMAT measures of statistical significance of the predicted correlations with larger values indicating stronger correlations (painted in intensive red). This layer of information is useful to study the individual residues whose change throughout the evolution in structures of homologous proteins is strongly correlated with other changes in these structures.


Layer 2: Pairs of correlated amino acid residues


Each pair of predicted correlated positions is connected by a dashed line in the structure of the representative protein. The gradient-paint of a dashed line between two positions is proportional to the correlation Z-score for this pair of positions (with intensive red indicating stronger correlation). This layer of information is useful to study the pairs and clusters of correlated positions.


Layer 3: Annotation of positions according to the cumulative degree of correlation


The CA-atom of each position in the representative protein structure is shown as a sphere whose radius is proportional to a sum of Z-scores of all predicted correlations of this position with other positions in the structure. Larger spheres indicate positions which tend to participate in a larger number of correlations with other positions, or tend to participate in stronger correlations with other positions, or both. This layer of information is useful to study the pairs and clusters of correlated positions with a focus on the most significant individual residues involved in these correlations.


Layer 4 (sub layers 4.1-4.7): Annotation of potential binding sites in the representative protein structure

Seven sub layers provide information about the potential binding sites predicted by the FPocket algorithm with different settings - i.e., the crucial Minimum number of a-sphere per pocket parameter taking values in a range from 10 (small pockets are preferred) to 50 (large pockets are preferred). The 'fpocket_50' and 'fpocket_40' sub layer show only the largest pockets and cavities on the protein surface. On the opposite, the 'fpocket_10' and 'fpocket_20' sub layers show only small pockets, and in particular prefer to display multiple subpockets of larger pockets as individual binding sites. The middle sub layers 'fpocket_25', 'fpocket_27', and 'fpocket_30' provide the balanced prediction of the potential binding sites.
When using the visualCMAT server only the 'fpocket_30' sub layer can be viewed by the interactive tool on the Analysis page. The complete visualCMAT annotation can be downloaded as the 'PSE' session file on the Results page.
This layer of information is useful to study distant communication between topologically independent sites in proteins by mapping the binding sites and the correlated positions on the same protein structure. See below for more information.




[To Navigation]
Guidelines on working with the visualCMAT output
The primary visualCMAT output is a content-rich file in the PyMol PSE format. The file contains the structure of the representative protein and multiple layers of annotation which correspond to the bioinformatic, statistical and structural analyzes of the predicted correlated mutations. Each annotation layer is provided as a separate object (objects) in the Pymol PSE session. The user can benefit from the multi-layered annotation by selecting and combining different types of information for a particular purpose. Layers with different content can be turned on and off in the PyMol viewer. This feature can be used to study the annotation layers independently, and also provides an opportunity to combine the selected layers to create new information content for expert analysis.

For beginners we recommend the following basic guide to working with the visualCMAT output:

  • Scroll down the object menu (the panel to the right in the PyMol viewer, which lists the individual objects read from file, each object as a horizontal grey bar) and disable the presentation of the binding sites (left click on "fpocket10", "fpocket20", ..., "fpocket50" objects will change their color to a darker shade of grey and remove the corresponding content from the graphical viewer);

  • If you are interested in particular amino acid residues (e.g., residues which were shown by the experimental site-directed mutagenesis to have an impact on protein function, activity or stability), then start from visualizing only the correlated pairs which are formed by these residues - disable all objects of the type "PAIR" (PAIR[ID]__[RES1]-[RES2]__[Z-score]) which correspond to pairs of residues which are not of interest to you. You can now see how the selected residues correlate with other positions in the structure and how strong are these correlations (i.e., by the intensity of the red gradient paint of the backbone and the dashe lines, as well as the size of the spheres);

  • If you do not have any particular residues in focus you should make sure that the "res_scores" object is enabled and then pay attention to the residues marked by the largest spheres. Switch off all "PAIRS" involving other residues. You can now see the network of the most significant correlations in you protein family/superfamily;

  • Evaluate the most statistically significant pairwise correlations - enable only the first few "PAIR" objects with the largest Z-scores (PAIR[ID]__[RES1]-[RES2]__[Z-score]);

  • To learn the co-occurrence of the particular amino acid types at the selected positions use the "Correlated positions ID table" file from the supplementary output. For any position in the representative protein structure this table will provide its ID in the corresponding protein sequence. This SeqID can be used to browse the "CMAT list of correlated pairs" for a detailed information about the amino acid content at the selected position(s);

  • Study the possible functional and regulatory importance of the most significant selected correlations by enabling an annotation layer with the predicted binding sites. Enable only one binding sites layer at once (e.g., "fpocket_30"). The "fpocket_30" layer usually provides the balanced view on the potential binding sites and cavities on the protein surface, but can miss out smaller sites or merge two independent but closely located sites into one larger pocket. You should also try the "fpocket_27", ..., "fpocket_10" layers for a potentially more information-rich annotation of the binding sites. You can now see if the selected correlations could be involved in long-range communication between topologically independent binding sites in your protein structure.




[To Navigation]
The visualCMAT example
The visualCMAT server has a Demo mode available. To request the demonstration press the "Demo mode" button at the submission page and then press "Submit".

Use the following input data to test the visualCMAT tool locally:

  • Multiple sequence alignment file of Mitogen-Activated Protein Kinases: MAPK.fasta
  • The CMAT output (the first protein is 0_1p38_A): MAPK.cmat
  • The structure of the representative protein Human alpha-MAPK (PDB: 1P38): 0_1p38_A.pdb

The MAPK.fasta alignment file was used to run CMAT to produce the MAPK.cmat file. The MAPK.fasta alignment file is not needed to run visualCMAT and provided for information purposes. Run visualCMAT in Linux shell:
./visualCMAT2.pl MAPK.cmat 0_1p38_A.pdb MAPK_visualcmat zc

The visualCMAT output: MAPK_visualcmat.tar.gz




[To Navigation]
Implementation of visualCMAT in the laboratory practice
Co-evolving positions in a protein family is where the occurrence of mutations in one region is compensated by mutations in another region, e.g. to maintain energetically favorable interactions, and can be used to highlight pairs of structurally important residues. Co-evolving positions do not necessarily correspond to amino acids which are in close structural proximity and may highlight more complex biological constraints as a result of evolutionary adaptation. Studying of co-evolution in protein structures can help to understand the relationship between structure and function in protein superfamilies. Correlated positions are considered as hotspots for rational design and directed evolution experiments to produce mutant enzymes with improved properties and studied to understand the mechanism of allosteric regulation in proteins with multiple ligand binding centers.

You should see the following publication for more information on studing co-evolving/correlated positions and their role in proteins:

Suplatov, D., Kirilin, E., & Švedas, V. (2016). Bioinformatic Analysis of Protein Families to Select Function-Related Variable Positions. In Understanding Enzymes: Function, Design, Engineering, and Analysis (pp. 351-385) Ed. Allan Svendsen. Pan Stanford.

Suplatov, D., Voevodin, V., & Švedas, V. (2015). Robust enzyme design: Bioinformatic tools for improved protein stability. Biotechnology journal, 10(3), 344-355.

Suplatov, D., & Švedas, V. (2015). Study of functional and allosteric sites in protein superfamilies. Acta Naturae, 7(4), 27, 34-45.




[To Navigation]
Download visualCMAT
visualCMAT v. 0.98 [2017-09-15] download




[To Navigation]
Citing visualCMAT

If you find visualCMAT or its results useful please cite our work:

Suplatov D.A., Sharapova Ya.A., Kopylov K.E., Švedas V. (2017) The visualCMAT tool to select and interpret correlated mutations/co-evolving residues in protein structures. https://biokinet.belozersky.msu.ru/visualcmat

Please also cite the work of third-party contributors, whose programs and algorithms are currently implemented in the visualCMAT web-server:

The CMAT algorithm is implemented to predict the correlated mutations/co-evolving residues:
Jeong, C. S., & Kim, D. (2012). Reliable and robust detection of coevolving protein residues. Protein Engineering Design and Selection, 25 (11), 705-713.

The MAFFT program is implemented to compare the sequence of the representative PDB with the multiple sequence alignment:
Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular biology and evolution, 30(4), 772-780

The FPOCKET program is implemented to predict binding sites in the representative protein structure:
Schmidtke, P., Le Guilloux, V., Maupetit, J., Tuffery, P. (2010). Fpocket: online tools for protein ensemble pocket detection and tracking. Nucleic acids research, 38(suppl_2), W582-W589.

The PyMol program is implemented to compile the binary PSE session with the annotated representative structure:
PyMol: A molecular visualization system. http://pymol.org/, Copyright (C) Schrodinger, LLC




[To Navigation]
Contacts and support

Yana Sharapova




Flag Counter