Guide to Zebra input

Q: How does Zebra work?

A: Multiple sequence alignment and, optionally, structural information about the protein superfamily are used as an input. The algorithm does not require pre-defined subfamilies and can propose multiple classifications automatically by graph based clustering at different fragmentation levels. Random shuffling and Bernoulli statistics are applied to rank hits by decreased significance and select highly valuable SSPs for further evaluation. Zebra results are provided in two ways – as a single all-in-one parsable text file and PyMol sessions with structural representation of SSPs.

Q: I want to understand how the server works.

A: Please consult our paper. We discuss the potential of Zebra to be used for understanding the structure–function relationship in proteins and provide practical guidelines to perform the bioinformatic analysis.
Suplatov D., Kirilin E., Takhaveev V., Švedas V. (2014). Zebra: web-server for bioinformatic analysis of diverse protein families, J.Biomol.Struct.Dyn., PMID:24028489.

Q: What input data is required to run Zebra?

A: Multiple sequence alignment (MSA) and, optionally, structural information about the protein superfamily are used as an input. In other words, a multiple sequence alignment is the only mandatory requirement for the analysis. See more details below for Zebra run modes.

Q: What are the requirements for a multiple sequence alignment?

A: The multiple sequence alignment should represent superimposition of protein amino acid sequences and should be in fasta format. It should contain at least six protein sequences. The latter requirement comes from the fact that at least two group (subfamilies) are required to assign the subfamily-specific positions and each group (subfamily) should contain at least three sequences to attain a reasonable level of statistical significance. The "-" sign will be treated as a gap. Standard set of 20 canonical amino acids is allowed. "B", "Z", "X" residue types will be automatically substituted for "D", "E", "G", respectively, with a warning message. Other characters are not allowed.

Q: Do I need to supply a classification of sequences into functional subfamilies?

A: The algorithm does not require pre-defined subfamilies and can propose multiple classifications automatically by graph based clustering at different fragmentation levels. The proposed functional subfamily classifications are used to identify the subfamily-specific positions. Functional subfamily classifications are finally ranked by significance of SSPs they produce. Alternatively, experimentally derived functional annotation can be provided by the user in the "Manual" mode (see below for explanation of Zebra run modes).

Q: Do I need structural information to run Zebra?

A: No, structural information is not mandatory and the analysis can be implemented without any structural data available. However, in order to use the most advanced features of the bioinformatic analysis, a PDB file corresponding to one of the sequences from MSA has to be submitted to the server. Incorporation of the 3D information can significantly improve Zebra predictions.

Q: How many structures should I submit to the server?

A: You need to submit one representative structure in PDB format that corresponds to one of the aligned sequences. In other words the amino acid content of the protein in the PDB file should match one of the protein sequences from the input multiple sequence alignment file with at least 95% pairwise sequence identity. Therefore, you are advised to extract the sequence information from the PDB of interest and use it together with other protein sequences of a particular family when creating the final alignment. You can acquire sequence information at the (Display Files >> FASTA Sequence) or by using our command-line software PDBParser2.
Please note that the server automatically corrects your PDB in the following way. First, alternative locations of the same atom are removed and only the first occurrence of each atom will be retained. Second, Selenomethionine residues (MSE) in the HETATM field will be switched to canonical Methionines under the ATOM field. Therefore, make sure that Methionines in the PDB are also present in the sequence alignment.

Q: How should I indicate which protein sequence does the PDB correspond to?

A: Our preprocessing system will try to automatically superimpose your PDB with your multiple sequence alignment. Alternatively, you can set the "Reference" id in the "Manual" mode (see below for explanation of Zebra run modes). If your input data fails to match don`t hesitate to send us your data and we will implement your special case into our preprocessing system to ease further use of the server.

Q: How does Zebra use structural information?

A: Structural information is used by Zebra to evaluate the relationship between subfamily-specific and conserved residues in the 3D space and incorporate this data in the specificity calculations. PDB will also be used to prepare structural representation of the output results.

Q: I have structural information available for more than one protein from the alignment. How should I choose the representative PDB?

A: The user can choose any family member that has structural information available. It can be the target protein selected for the further experimental design or simply the most studied member of the group. Selecting a coordinate structure file that represents an enzyme-substrate complex will make it more convenient to study specific residues involved in ligand binding and catalytic conversion.

Q: I have heteroatoms in my PDB. Is it OK?

A: It depends on the nature of the heteroatoms.
The server automatically corrects Selenomethionine residues (MSE) in the HETATM field to canonical Methionines under the ATOM field. Therefore, make sure that Methionines in the PDB are also present in the sequence alignment. All other non-canonical amino acids that are tagged as HETATM will be dismissed and treated as a gap when matching with the corresponding sequence from the alignment.
If hetatoms are not part of the protein and instead represent ligands, cofactors, solvent molecules etc. they will be kept in the file and highlighed in the structural output files. In this respect selecting a representative structure file that represents an enzyme-substrate complex will make it more convenient to study specific residues involved in ligand binding and catalytic conversion.

Q: What are the different Zebra run modes: QuickZebra, QuickZebra+3D and Manual?

A: Zebra provides three input modes that differ by complexity and type of the input data required to start the analysis. The “QuickZebra” mode is the most straightforward and easy to use way to run the bioinformatic analysis which requires a MSA only for the input. The “QuickZebra + 3D” mode performs sequence and structural bioinformatic analysis and in addition to the sequence input requires a PDB structure file that should correspond to one of the MSA sequences. Finally, the “Manual” mode provides the ability to edit
algorithm parameters that control the automatic classification and identification of SSPs.

Q: I have only sequence information about my proteins. Which Zebra mode should I use?

A: You should align your sequences and submit a fasta format MSA in the "QuickZebra" mode.

Q: I have both the sequence information about my proteins and at least one corresponding structure. Which Zebra mode should I use?

A: You should choose the "QuickZebra+3D" mode in order to use the most advanced features of the bioinformatic analysis.

Q: I want to control automatic subfamily classification procedure or use my own subfamily classification. Which Zebra mode should I use?

A: You should choose the "Manual" mode in order to edit the
algorithm parameters that control the automatic classification. On the submission page scroll down to section "Functional subfamily classification". Here you can submit your own subfamily classification or specify clustering parameters for the automatic classification.

Q: I don`t like the automatic classification proposed by Zebra. What can I do?

A: If you are not happy with the automatically proposed classifications it could mean that functional groups in your alignment substantially vary in size (for example - one subfamily contains 90% of the proteins and another 2 subfamilies do not exceed 5% of sequences) or are too small compared to the overall size of the sample (for example, more than 20 subfamilies in the set with some groups containing less than 5% of sequences). In this case try setting the "Subfamily size limit" to small values and repeat the calculation. By default each subfamily is allowed to contain not less than 5% of the sample size. Consequently, setting the limit to, for example, 3 sequences will allow small groups for bioinformatic analysis. You should choose the "Manual" mode, scroll down to section "Functional subfamily classification", select the checkbox "Specify clustering parameters manually" and type a number corresponding the expected minimum number of sequences in a subfamily in the "Subfamily size limit" text field.

Q: I want to control the SSP prediction procedure. Which Zebra mode should I use?

A: You should choose the "Manual" mode in order to edit the algorithm parameters that control the SSP prediction.

Q: Which parameters to set in the "Manual" mode?

A: If for any reasons you don`t like the default setup provided by the "QuickZebra" and "QuickZebra+3D" modes you can set the parameters manually. See description of parameters below and consult our paper for benchmarking with different setup.
Suplatov, D., Shalaeva, D., Kirilin, E., Arzhanik, V., & Švedas, V. (2013). Bioinformatic analysis of protein families for identification of variable amino acid residues responsible for functional diversity. J.Biomol.Struct.Dyn., PMID:23384165.

Also check out the bioinformatic analysis of glutathione S-transferase superfamily for example of the input data and parameters setup.

Management of input data

  • Multiple sequence alignment of a protein family. At least 6 sequences are required for the bioinformatic analysis (at least 3 sequences per subfamily and at least two subfamilies).
  • Gap threshold - maximal gap occurrence in a column. Columns dominated by gaps usually do not contain any important information.
    Example: set to "30" to remove columns with more than 30% of gaps
    Default: 5% of gaps
  • Reference and offset . Select a sequence to be used as reference and an offset value to amino acid position in the sequence in the output file. The two parameters would not affect the calculations.
    Example: setting reference to "5" will select the 5th sequence (ex.:ADSST) from the top of the alignment file as a reference. Positions will be shown in the output file as 1A, 2D, 3S, 4S, 5T. Setting offset to "3" would change it to 4A, 5D, 6S, 7S, 8T and could be useful in case the alignment sequence is incomplete and misses first three residues. Offset could be set to a negative value.
    Default: 1st sequence is taken as reference with zero offset (positions are numbered according to the order they appear in the sequence

Prediction of subfamily-specific positions

  • Specificity scoring function. RESP function that considers residue conservation and physicochemical conservation will be used.
  • Random permutations. Reliability of statistical calculations is regulated by number of random permutations.
    Example: Setting to "1000" will perform 1000 random permutation in every column of a multiple sequence alignment
    Default: 1000 shuffles

Prediction of subfamily-specific positions: optional input

  • Upload a PDB coordinate structure file that corresponds to one of the sequences in the alignment.

    Only if the PDB coordinate structure file had been uploaded:

  • Define "Active site" area. Residues from the active site will be indicated as '*' in the output file.
    Example: "ATP 10" will select all residues within 10 angstroms from any atom of ATP molecule from the PDB file.
    Default: active site definition is off
  • Use 3D-mode: set radius to calculate neighborhood for every residue and number of random permutations to calculate conserved positions.
    Example: set radius to "4" to consider specificity and conservation of neighboring residues within 4 angstroms when calculating specificity of a residue. Set random permutation to "1000" for 1000 shuffles in every column.
    Default: 4 angstroms radius is used to calculate neighbors with 1000 random permutations to calculate conservation rate in every column

Functional subfamily classification

  • Manual subfamily definition. If pre-defined subfamily classification is available user is welcome to provide it to the program. Classification is submitted as a text file listing space separated sequence ID`s that belong to one family in one line and different subfamilies as different lines (first sequence has id of 1).
    Example: Alignment, Groupfile. Perl script can be used to learn sequence id`s from a fasta alignment and assist groupfile preparation.
  • Alternatively, user has an opportunity to process his request in the absence of external functional annotation. Zebra provides a built-in procedure that can be used to Predict functional subfamilies.
    • Subfamily size limit. Zebra can create subfamilies with at least 3 sequences (minimally reasonable value). Program was benchmarked with the value of '3' and showed competitive results. However, if you are analyzing a superfamily of thousandths of sequences you are probably not interested in looking at subfamilies of size 3. Thus, to save computational time you can adjust this parameter to the expected size of the smallest subfamily. For CPU time efficiency this parameter has been set by default to 5% of the input sample size but can be changed to any value by the user. If you are not happy with automatically proposed classifications it could mean that functional groups in your alignment substantially vary in size or are too small compared to the overall size of the sample. Try setting "Subfamily size limit" to smaller values (for example: 3 sequences) and repeat the calculation.
      Example: value of "3" will allow subfamilies with at least 3 sequences
      Default: 5% from the number of sequences in alignment but not less then 3 sequences
    • Outliers . User has an option to select a threshold for outliers (sequences not assigned to a subfamily). Classification exceeding this threshold will be removed. Setup for this parameter showed comparable performance in a range 0-30%. Thus, value 20% is set by default.
      Example: value of '0.2' will make the program to accept classification with not more than 20% outliers compared to the sample size
      Default: 0.2 (not more than 20% from the samle size)
    • Search by expected number of subfamilies. User has an option to select expected number of subfamilies. This can be set either as a range or as a particular value. Classifications exceeding this threshold will not be considered. Zebra showed significantly better performance when given expected number of subfamilies as an input.
      Example: setting "mingroups" to "2" and "maxgroups" to "2" will make the program to create only two-group classifications
      Default: number of subfamilies is not limited

Troubleshooting guide for common errors related to the "Step 4: The Zebra bioinformatic analysis"
We recommend to submit a representative protein structure together with a multiple sequence alignment. Availability of structural information can increase the accuracy of bioinformatic predictions (the 3D-mode). The user can also benefit from the structure-based annotation of the subfamily-specific positions which is a convenient tool to study the server output.

It is expected that the submitted protein structure
(1) corresponds (i.e., is 100% identical) to one of the protein sequences in the alignment, and
(2) does not contain errors (i.e., ambiguous formatting, incomplete records, etc.).

The server implements the automatic preprocessing step to help the user and prepare his data for compatibility with this server. The task will be accepted even if the highest pairwise sequence identify between the submitted protein structure and any protein in the alignment is far below 100%. In rare cases the automatic preprocessing can not be applied due to ambiguity of the user input. In these cases the "Step 4: The Zebra bioinformatic analysis" would fail and the user would have to manually correct the input prior to a new submission.

The common errors and recommended solutions are described below. If this troubleshooting guide does not solve your problem please send us an error report and we will be happy to provide assistance in your particular case.

ERROR: Internal error while calculating structure-based neighbour list (0 neighbours found)

This error usually happens when the representative protein, which was submitted as a PDB file to the server, has poor similarity (local or global) with the reference protein in the multiple sequence alignment.

The PDB file is expected to represent one protein from the sequence alignment, i.e. the two should be identical in the amino acid sequence. You should always aim at submitting a representative PDB structure that corresponds to a protein which is highly similar to the reference protein sequence in the alignment (i.e., >95%). For your convenience Zebra implements the preprocessing step to automatically select the reference protein in the multiple alignment that has the highest pairwise sequence similarity with the representative protein in the PDB. All mismatching regions between the representative protein structure in the PDB and the reference protein sequence in the alignment are automatically removed. Significant changes to the representative protein structure and the reference protein sequence could be introduced at this step if the representative protein sequence has low similarity to the reference protein sequence. The low sequence similarity can be global, i.e. the two proteins are remote homologs, or local, e.g. the sequence and structure belong to the same protein but some flexible loop is only partially resolved in the PDB. As a result of these changes the automatically updated PDB structure can contain gaps in the backbone. If a residue in the structure has no neighbours within the 3D-mode cut-off radius (4A by default) then the bionformatic analysis fails.

There are three ways to resolve this problem:

(1) Quick and easy workaround - can make it work but is likely to introduce bias and reduce the scientific value of the results, should be used for information purposes only. Open the log file and go to the "Step 3: Preprocessing of the Input Data" section. Download the PDB structure of the representative protein after preprocessing (you can find the link at the end of the section). Have a look at the annotated pairwise alignment between the "pdb" and the "best_match". Find the "lonely" residues in the "pdb" and delete them from the downloaded PDB file. E.g., >pdb ANKGGPSEGA means that only the P was preserved in the PDB file after preprocessing and all other residues were dismissed. This Proline is likely to cause the error when using the 3D-mode because it seems to be too far from other residues in the 3D space. Check that the Proline is also "lonely" in the structure by looking at the corresponding PDB file and then remove it (delete the coordinates of the corresponding atoms) from the PDB file. Submit a new task to the server;

(2) Find a better representative protein in the PDB database. BLAST your multiple alignment versus the sequences of proteins in the PDB database and select a better match than your current representative protein;

(3) Build a 3D model of the representative protein based on the available structural data. You can reconstruct the missing loops in the globule or predict the entire structure using the homology modeling. The highly capable Modeller software can do both. If you do not know how to use the Modeller to build a model of the representative protein for Zebra/pocketZebra you can contact us and we will e-mail you the template scripts.

ERROR: Neighbour list for acid with alignment position XXX contains pdb position YYY not presented in the alignment

This error means that you have an incomplete amino acid record in your PDB file, i.e. the coordinates of the 'CA' atom of a certain residue are missing. This problem with your PDB file has to be corrected manually.

Open the PDB file in the 3D structure viewer (e.g., PyMol) or in the text editor and find the YYY'th amino acid, i.e., the YYY'th amino acid record in the file (the count starts from 1). Please note that the residue ID of the YYY'th amino acid would be different from "YYY" if the residue numbering in the PBD does not start from 1 and/or some residues are missing in the backbone. Check if the record for the YYY'th amino acid is incomplete (if not, you might have lost your count), e.g.:

ATOM    675  N   ALA A  84      30.430   1.654  28.087  1.00 11.02           N  
ATOM    676  CA  ALA A  84      31.779   1.682  27.569  1.00 13.03           C  
ATOM    677  C   ALA A  84      32.111   2.986  26.814  1.00 14.26           C  
ATOM    678  O   ALA A  84      33.266   3.161  26.456  1.00 13.99           O  
ATOM    679  CB  ALA A  84      32.036   0.525  26.603  1.00 16.45           C  
ATOM    680  N   LEU A  85      31.167   3.874  26.582  1.00 14.45           N  
ATOM    688  N   SER A  86      33.067   6.758  25.759  1.00 15.08           N  
ATOM    689  CA  SER A  86      33.859   7.861  26.275  1.00 16.99           C  
ATOM    690  C   SER A  86      32.871   8.932  26.675  1.00 18.70           C  
ATOM    691  O   SER A  86      31.712   9.057  26.205  1.00 16.46           O  
ATOM    692  CB  SER A  86      34.744   8.419  25.138  1.00 17.40           C  
ATOM    693  OG  SER A  86      33.883   9.078  24.215  1.00 17.15           O  

In the example above the record of the LEU85 is incomplete as the coordinates for most atoms (including the 'CA' atom) are missing.

Once you find the residue in question you have the following possible actions:
(1) remove (delete) the entire residue from the PDB file and submit a new task;
(2) append the coordinates of the missing atoms (e.g., with the help of Modeller molecular modeling package) to the PDB file and submit a new task. In this case you may need to re-build the entire multiple alignment to include this residue.