Research :: Zebra :: Output

Back to Zebra Main

Explanation of Zebra Output


Zebra results are provided in two ways – as a single all-in-one parsable text file and PyMol session files with structural representation of SSPs.

Hint: Check out the bioinformatic analysis of glutathione S-transferase superfamily for example of the output data


Single all-in-one parsable text file
Zebra output text file contains:
  • List of subfamily classifications ranked by decreased significance (if automatic classification was used)
  • For every subfamily classification (can be more than one if automatic classification was used):
    • List of predicted conserved positions (if used in 3D-mode) including results of statistical evaluation and amino acid content of corresponding columns
    • List of predicted subfamily-specific positions including results of statistical evaluation and amino acid content of corresponding columns
Zebra output file can be used independently without the need for cross-reference with original MSA. It is a plain text file which contains context-specific headers at the beginning of each line for easy parsing using command-line tools and simple editors and can thus be integrated into automatic pipe-lines. Detailed explanation is provided below.



STAT1 field at the end of the output file describes ranking of subfamily classifications by significance of SSPs they produce. Each subfamily classification is named by RUNC[X] where RUNC stands for "RUN Combination" and [X] refers to ID of a subfamily classification. User can select one of many classifications for further study.
Once again the id of best subfamily classification is stated under RESULTS1 field.


RUNC. Information about subfamily-specific positions for a given subfamily definition is provided under "RUNC[X]" tag where [X] refers to the order in which subfamily definitions were analyzed.


GRPS SEQN Every RUNC contains definition of subfamilies used to calculate SSPs - that is the way initial set of enzymes was separated into groups. GRPS field defines Subfamily (Group) identifier while SEQN field contains information about sequences belonging to that Subfamily.


STAT field contains results of statistical evaluation - cut-off P-values that split positions into sets by statistical significance.
DATA field contains explicit information about every position of the alignment ranked by declined specificity:
  • Rank - index of a position, smaller ranks indicate higher specificity
  • Pos - index of a column in the multiple alignment
  • Raw-score - specificity score according to original Specificity scoring function as described in Algorithm section;
  • Z-score - specificity score corresponding to standard normal distribution
  • P-value - statistical significant of observed set of specific positions. P-value for a position with rank k denotes a probability to obtain by chance a set of positions with ranks from 1 to k.
  • Reference - corresponding residue in reference sequence; positions in active center area(if defined) are marked with "*";
  • SEQref and PDBref - if a PDB structure file has been submitted then the subfamily-specific columns will be referenced to both the alignment sequence and the PDB structure. These references might not be identical what indicates missing residues in the alignment sequence or/and PDB. This is not an error since database sequences/structures frequently contain missing data. Therefore, dual references proposed by Zebra provide more flexibility to the user;
  • Sequence.by.group - full content of multiple alignment column, amino acid order correspond to sequences order listed in SEQN field, subfamilies are separated with dots.
Conserved positions are calculated as part of 3D-mode for every subfamily definition as they might differ in outliers. Output for conserved positions precedes output for SSPs in the output file. For example, if you want to look at conserved positions calculated during RUNC2 - search for RUNC2 and scroll up. Structure of the output for conserved positions is similar to output for SSPs but does not contain "RUNC[X]" tag in the beginning of each line.



PyMol session files

If a structure has been provided to the server (in either QuickZebra+3D or Manual mode) it is used to automatically produce PyMol session files (.pse) with structural representations of subfamily-specific and conserved positions for each subfamily classification. These files can be downloaded and used straightaway for structural analysis with PyMol without any additional programming experience.

The PyMol sessions have 4-layered structure that corresponds to different data. These layers can be switched on and off to create a visual representation of the bioinformatic analysis results helpful in your particular task. A snapshot of the default representation of results is shown to the right: first is the layer of subfamily-specific positions where the red-to-cyan gradient paint corresponds to significance of the observed hits; second is the layer of the most significant conserved positions visible as sticks (yellow); finally the heteroatoms are shown in green.




Layer 1: SUBFAMILY-SPECIFIC

This layer contains information about the subfamily-specific positions identified in your family. The C-alpha atoms of significant SSPs are gradient-painted according to calculated specificity Z-scores: red stands for highly significant hits, cyan – for non-informative and conserved positions.

The subfamily-specific positions are organized into sets based on statistical significance - local P-value thresholds - and shown as "SSP_Set_X_P-value" on the right panel of the PyMol viewer, where X stands for the set id and P-value corresponds to assigned significance (see STAT field). P-value corresponds to a probability of all positions in a set to be observed by chance. All hits above the global P-value threshold are shown. Sets are ranked in declined significance.

Sets are cumulative. This means that Set4 includes all positions from Sets 1 to 3 and adds new ones. Positions from the first set (the first best hits) are shown as sticks. Click on the "S" menu for a particular set to select a representation method (for example: show as sticks).



Layer 2: CONSERVED

This layer contains information about the significant conserved positions identified in your family. The Carbon atoms of conserved positions are gradient-painted according to calculated conservation Z-scores: yellow stands for highly significant hits, grey – for non-conserved positions.

The conserved positions are organized into sets based on statistical significance - local P-value thresholds - and shown as "CON_Set_X_P-value" on the right panel of the PyMol viewer, where X stands for the set id and P-value corresponds to assigned significance (see STAT field). P-value corresponds to a probability of all positions in a set to be observed by chance. All hits above the global P-value threshold are shown. Sets are ranked in declined significance.

Sets are cumulative. This means that Set4 includes all positions from Sets 1 to 3 and adds new ones. Positions from the first set (the first best hits) are shown as sticks. Click on the "S" menu for a particular set to select a representation method (for example: show as sticks).



Layer 3: SKIPPED

This layers contains information about positions that were not considered for SSP prediction. These are either invariant (completely conserved) or overpopulated by gaps in the multiple alignment you`ve submitted for analysis:
  • A position is colored in white if it is either invariant (completely conserved) or overpopulated by gaps - contains more gaps than allowed by the threshold (the default is not more than 5% of gaps in a column). Such positions are skipped during the bioinformatic analysis and the SSPs prediction. Also, the skipped positions are shown as sticks so that you can see them under the first two layers (SUBFAMILY_SPECIFIC and CONSERVED). You can change the representation of the skipped positions using the "skipped_sel" selection.
  • A position is colored in black if its gap population is within the allowed limits and it is not invariant (completely conserved). Such positions are common for the entire family and are considered by the bioinformatic analysis.
If a positions (or a set of positions) contains high proportion of gaps it can mean that a corresponding region in protein structure is not present or does not have a common organization in all members of your family. Therefore, this layer will help you to see what parts of the structure are significantly different among proteins within your family.

On the example shown to the right we can see that most of the protein is colored in black meaning that its structure is common for the entire family. The white-colored loops in the "north-west" of the molecule indicate regions with un-constant structure within the family. It means that the white regions were not properly aligned in your MSA and thus will not be considered for the bioinformatic analysis.

If the white-colored positions (overpopulated by gaps and thus skipped during the bioinformatic analysis) appear in protein regions that are important for your work you can:
  • Realign your proteins with a particular focus on the regions of interest. For example, if a particular loop is present in all homologous proteins and has the same length but different shape and orientation it can be misaligned by structure superimposition software. In this case you may have to manually correct your alignment.
  • Exclude from the family set those proteins that do not contain the region you want. You should, however, keep in mind the fact that not all members of your family have the corresponding region.
  • Release the gap threshold in the Manual mode. However, if you allow columns with too many gaps it could produce meaningless results and also jeopardize automatic classification algorithm. See the algorithm paper for more on gap threshold. You are therefore advised to keep the gap threshold reasonably low (the default is not more than 5% of gaps in a column) and choose the right alignment for a particular task.



Layer 4: HETATOMS

Heteroatoms (substrates, inhibitors, ions and water molecules) are shown as sticks, carbons are colored in chartreuse (color halfway between yellow and green).