Research :: Zebra :: Input

Back to Zebra Main

Guide to Zebra input

Q: How does Zebra work?

A: Multiple sequence alignment and, optionally, structural information about the protein superfamily are used as an input. The algorithm does not require pre-defined subfamilies and can propose multiple classifications automatically by graph based clustering at different fragmentation levels. Random shuffling and Bernoulli statistics are applied to rank hits by decreased significance and select highly valuable SSPs for further evaluation. Zebra results are provided in two ways – as a single all-in-one parsable text file and PyMol sessions with structural representation of SSPs.

Q: I want to understand how the server works.

A: Please consult our paper. We discuss the potential of Zebra to be used for understanding the structure–function relationship in proteins and provide practical guidelines to perform the bioinformatic analysis.
Suplatov D., Kirilin E., Takhaveev V., Švedas V. (2013). Zebra: web-server for bioinformatic analysis of diverse protein families, J.Biomol.Struct.Dyn., dx.doi.org PMID:24028489.

Q: What input data is required to run Zebra?

A: Multiple sequence alignment (MSA) and, optionally, structural information about the protein superfamily are used as an input. In other words, a multiple sequence alignment is the only mandatory requirement for the analysis. See more details below for Zebra run modes.

Q: What are the requirements for a multiple sequence alignment?

A: The multiple sequence alignment should represent superimposition of protein amino acid sequences and should be in fasta format. It should contain at least six protein sequences. The latter requirement comes from the fact that at least two group (subfamilies) are required to assign the subfamily-specific positions and each group (subfamily) should contain at least three sequences to attain a reasonable level of statistical significance. The "-" sign will be treated as a gap. Standard set of 20 canonical amino acids is allowed. "B", "Z", "X" residue types will be automatically substituted for "D", "E", "G", respectively, with a warning message. Other characters are not allowed.

Q: Do I need to supply a classification of sequences into functional subfamilies?

A: The algorithm does not require pre-defined subfamilies and can propose multiple classifications automatically by graph based clustering at different fragmentation levels. The proposed functional subfamily classifications are used to identify the subfamily-specific positions. Functional subfamily classifications are finally ranked by significance of SSPs they produce. Alternatively, experimentally derived functional annotation can be provided by the user in the "Manual" mode (see below for explanation of Zebra run modes).

Q: Do I need structural information to run Zebra?

A: No, structural information is not mandatory and the analysis can be implemented without any structural data available. However, in order to use the most advanced features of the bioinformatic analysis, a PDB file corresponding to one of the sequences from MSA has to be submitted to the server. Incorporation of the 3D information can significantly improve Zebra predictions.

Q: How many structures should I submit to the server?

A: You need to submit one representative structure in PDB format that corresponds to one of the aligned sequences. In other words the amino acid content of the protein in the PDB file should match one of the protein sequences from the input multiple sequence alignment file with at least 95% pairwise sequence identity. Therefore, you are advised to extract the sequence information from the PDB of interest and use it together with other protein sequences of a particular family when creating the final alignment. You can acquire sequence information at the pdb.org (Display Files >> FASTA Sequence) or by using our command-line software PDBParser.
Please note that the server automatically corrects your PDB in the following way. First, alternative locations of the same atom are removed and only the first occurrence of each atom will be retained. Second, Selenomethionine residues (MSE) in the HETATM field will be switched to canonical Methionines under the ATOM field. Therefore, make sure that Methionines in the PDB are also present in the sequence alignment.

Q: How should I indicate which protein sequence does the PDB correspond to?

A: Our preprocessing system will try to automatically superimpose your PDB with your multiple sequence alignment. Alternatively, you can set the "Reference" id in the "Manual" mode (see below for explanation of Zebra run modes). If your input data fails to match don`t hesitate to send us your data and we will implement your special case into our preprocessing system to ease further use of the server.

Q: How does Zebra use structural information?

A: Structural information is used by Zebra to evaluate the relationship between subfamily-specific and conserved residues in the 3D space and incorporate this data in the specificity calculations. PDB will also be used to prepare structural representation of the output results.

Q: I have structural information available for more than one protein from the alignment. How should I choose the representative PDB?

A: The user can choose any family member that has structural information available. It can be the target protein selected for the further experimental design or simply the most studied member of the group. Selecting a coordinate structure file that represents an enzyme-substrate complex will make it more convenient to study specific residues involved in ligand binding and catalytic conversion.

Q: I have heteroatoms in my PDB. Is it OK?

A: It depends on the nature of the heteroatoms.
The server automatically corrects Selenomethionine residues (MSE) in the HETATM field to canonical Methionines under the ATOM field. Therefore, make sure that Methionines in the PDB are also present in the sequence alignment. All other non-canonical amino acids that are tagged as HETATM will be dismissed and treated as a gap when matching with the corresponding sequence from the alignment.
If hetatoms are not part of the protein and instead represent ligands, cofactors, solvent molecules etc. they will be kept in the file and highlighed in the structural output files. In this respect selecting a representative structure file that represents an enzyme-substrate complex will make it more convenient to study specific residues involved in ligand binding and catalytic conversion.

Q: What are the different Zebra run modes: QuickZebra, QuickZebra+3D and Manual?

A: Zebra provides three input modes that differ by complexity and type of the input data required to start the analysis. The “QuickZebra” mode is the most straightforward and easy to use way to run the bioinformatic analysis which requires a MSA only for the input. The “QuickZebra + 3D” mode performs sequence and structural bioinformatic analysis and in addition to the sequence input requires a PDB structure file that should correspond to one of the MSA sequences. Finally, the “Manual” mode provides the ability to edit
algorithm parameters that control the automatic classification and identification of SSPs.

Q: I have only sequence information about my proteins. Which Zebra mode should I use?

A: You should align your sequences and submit a fasta format MSA in the "QuickZebra" mode.

Q: I have both the sequence information about my proteins and at least one corresponding structure. Which Zebra mode should I use?

A: You should choose the "QuickZebra+3D" mode in order to use the most advanced features of the bioinformatic analysis.

Q: I want to control automatic subfamily classification procedure or use my own subfamily classification. Which Zebra mode should I use?

A: You should choose the "Manual" mode in order to edit the
algorithm parameters that control the automatic classification. On the submission page scroll down to section "Functional subfamily classification". Here you can submit your own subfamily classification or specify clustering parameters for the automatic classification.

Q: I don`t like the automatic classification proposed by Zebra. What can I do?

A: If you are not happy with the automatically proposed classifications it could mean that functional groups in your alignment substantially vary in size (for example - one subfamily contains 90% of the proteins and another 2 subfamilies do not exceed 5% of sequences) or are too small compared to the overall size of the sample (for example, more than 20 subfamilies in the set with some groups containing less than 5% of sequences). In this case try setting the "Subfamily size limit" to small values and repeat the calculation. By default each subfamily is allowed to contain not less than 5% of the sample size. Consequently, setting the limit to, for example, 3 sequences will allow small groups for bioinformatic analysis. You should choose the "Manual" mode, scroll down to section "Functional subfamily classification", select the checkbox "Specify clustering parameters manually" and type a number corresponding the expected minimum number of sequences in a subfamily in the "Subfamily size limit" text field.

Q: I want to control the SSP prediction procedure. Which Zebra mode should I use?

A: You should choose the "Manual" mode in order to edit the algorithm parameters that control the SSP prediction.

Q: Which parameters to set in the "Manual" mode?

A: If for any reasons you don`t like the default setup provided by the "QuickZebra" and "QuickZebra+3D" modes you can set the parameters manually. See description of parameters below and consult our paper for benchmarking with different setup.
Suplatov, D., Shalaeva, D., Kirilin, E., Arzhanik, V., & Švedas, V. (2013). Bioinformatic analysis of protein families for identification of variable amino acid residues responsible for functional diversity. J.Biomol.Struct.Dyn., dx.doi.org PMID:23384165.

Also check out the bioinformatic analysis of glutathione S-transferase superfamily for example of the input data and parameters setup.

Management of input data

  • Multiple sequence alignment of a protein family. At least 6 sequences are required for the bioinformatic analysis (at least 3 sequences per subfamily and at least two subfamilies).
  • Gap threshold - maximal gap occurrence in a column. Columns dominated by gaps usually do not contain any important information.
    Example: set to "30" to remove columns with more than 30% of gaps
    Default: 5% of gaps
  • Reference and offset . Select a sequence to be used as reference and an offset value to amino acid position in the sequence in the output file. The two parameters would not affect the calculations.
    Example: setting reference to "5" will select the 5th sequence (ex.:ADSST) from the top of the alignment file as a reference. Positions will be shown in the output file as 1A, 2D, 3S, 4S, 5T. Setting offset to "3" would change it to 4A, 5D, 6S, 7S, 8T and could be useful in case the alignment sequence is incomplete and misses first three residues. Offset could be set to a negative value.
    Default: 1st sequence is taken as reference with zero offset (positions are numbered according to the order they appear in the sequence

Prediction of subfamily-specific positions

  • Specificity scoring function. RESP function that considers residue conservation and physicochemical conservation will be used.
  • Random permutations. Reliability of statistical calculations is regulated by number of random permutations.
    Example: Setting to "1000" will perform 1000 random permutation in every column of a multiple sequence alignment
    Default: 1000 shuffles

Prediction of subfamily-specific positions: optional input

  • Upload a PDB coordinate structure file that corresponds to one of the sequences in the alignment.

    Only if the PDB coordinate structure file had been uploaded:

  • Define "Active site" area. Residues from the active site will be indicated as '*' in the output file.
    Example: "ATP 10" will select all residues within 10 angstroms from any atom of ATP molecule from the PDB file.
    Default: active site definition is off
  • Use 3D-mode: set radius to calculate neighborhood for every residue and number of random permutations to calculate conserved positions.
    Example: set radius to "4" to consider specificity and conservation of neighboring residues within 4 angstroms when calculating specificity of a residue. Set random permutation to "1000" for 1000 shuffles in every column.
    Default: 4 angstroms radius is used to calculate neighbors with 1000 random permutations to calculate conservation rate in every column

Functional subfamily classification

  • Manual subfamily definition. If pre-defined subfamily classification is available user is welcome to provide it to the program. Classification is submitted as a text file listing space separated sequence ID`s that belong to one family in one line and different subfamilies as different lines (first sequence has id of 1).
    Example: Alignment, Groupfile. Perl script can be used to learn sequence id`s from a fasta alignment and assist groupfile preparation.
  • Alternatively, user has an opportunity to process his request in the absence of external functional annotation. Zebra provides a built-in procedure that can be used to Predict functional subfamilies.
    • Subfamily size limit. Zebra can create subfamilies with at least 3 sequences (minimally reasonable value). Program was benchmarked with the value of '3' and showed competitive results. However, if you are analyzing a superfamily of thousandths of sequences you are probably not interested in looking at subfamilies of size 3. Thus, to save computational time you can adjust this parameter to the expected size of the smallest subfamily. For CPU time efficiency this parameter has been set by default to 5% of the input sample size but can be changed to any value by the user. If you are not happy with automatically proposed classifications it could mean that functional groups in your alignment substantially vary in size or are too small compared to the overall size of the sample. Try setting "Subfamily size limit" to smaller values (for example: 3 sequences) and repeat the calculation.
      Example: value of "3" will allow subfamilies with at least 3 sequences
      Default: 5% from the number of sequences in alignment but not less then 3 sequences
    • Outliers . User has an option to select a threshold for outliers (sequences not assigned to a subfamily). Classification exceeding this threshold will be removed. Setup for this parameter showed comparable performance in a range 0-30%. Thus, value 20% is set by default.
      Example: value of '0.2' will make the program to accept classification with not more than 20% outliers compared to the sample size
      Default: 0.2 (not more than 20% from the samle size)
    • Search by expected number of subfamilies. User has an option to select expected number of subfamilies. This can be set either as a range or as a particular value. Classifications exceeding this threshold will not be considered. Zebra showed significantly better performance when given expected number of subfamilies as an input.
      Example: setting "mingroups" to "2" and "maxgroups" to "2" will make the program to create only two-group classifications
      Default: number of subfamilies is not limited