Troubleshoot guide | biokinet.belozersky.msu.ru

Zebra troubleshooting guide

This page provides description and workarounds to tackle the common problems that can occur during the "Step 4: The Zebra bioinformatic analysis" of the Zebra pipeline. If this troubleshooting guide does not solve your problem please send us an error report, and we will be happy to provide assistance in your particular case.

Press the button to ask a question or send an error report

We recommend to submit a representative protein structure together with a multiple sequence alignment. Availability of structural information can increase the accuracy of bioinformatic predictions (the 3D-mode). The user can also benefit from the structure-based annotation of the subfamily-specific positions which is a convenient tool to study the server output.

It is expected that the submitted multiple alignment

(1) contains an alignment of sequences of homologous (evolutionary related) proteins,

(2) is in the FASTA format,

(3) does not contain a major amount of columns with a high content of gaps. By default, all columns with more than 5% gap frequency are dismissed,

(4) in general, Zebra/Zebra2 were developed to process large alignments of hundreds-to-thousands proteins. While the web-served will accept all alignments with at least six sequences, it may not work on very small alignments due to limitations of statistical models;

It is expected that the submitted protein structure

(1) is in the PDB format (i.e., meaning not just the '.pdb' file extension, but the way the actual plain text data is provided inside the file), in particular

(2) all standard amino acid residues should be provided in the "ATOM" field, and all ligands, ions, cofactors, inhibitors, non-standard amino acid residues - in the "HETATM" field;

(3) each item in the PDB (amino acid or a ligand) should have a unique identification (chain id + residue name + residue ID);

(4) the protein in PDB should correspond (i.e., is 100% identical) to one of the protein sequences in the alignment

(5) in general, the PDB should not contain errors (i.e., ambiguous formatting, incomplete records, etc.).

The server implements the automatic preprocessing step to help prepare the input data for compatibility with this server and attempt to correct minor errors. In particular, the task will be accepted even if the highest pairwise sequence identify between the submitted protein structure and any protein in the alignment is far below 100%. In rare cases the automatic preprocessing can not be applied due to ambiguity of the user input. In these cases the "Step 4: The Zebra bioinformatic analysis" would fail and the user would have to manually correct the input prior to a new submission.

The common errors and recommended solutions are described below. If this troubleshooting guide does not solve your problem please send us an error report and we will be happy to provide assistance in your particular case.

ERROR: The file submitted is not a valid FASTA Alignment

The input multiple alignment should be in the FASTA format (not ClustalW, not Phylip, etc.). If you do not know what is the format of your alignment - submit it to Zebra and you`ll find out. If you have your alignment in the wrong format use a sequence format converter, e.g., sequenceconversion.bugaco.com;

An example of the FASTA ALIGNMENT format:

>protein_name_1
ISPQHIQYFMYHILLGLHVLHE--AG--VVHRS---------DLHPGNI
>protein_name_2
LEESHMQYFVYQILRGLKYLHS--AN--VAHRS------KNCDLKPANL
>protein_name_3
LSNDHICYFLYQILRGLKYIHS--AN--VLHRT------KNCDLKPSNL
>protein_name_4
LTDDHVQFLIYQILRGLKYIHSDFANRGIIHRSIYPGAAKNCDLKPSNL
>protein_name_5
LEKQFIQYFLYQILRGLKYVHSDFAGRAVVHRSLFPAGGKQCDLKPSNI
>protein_name_6
IDKQFIQYFLYQILKGLKYVHTDFAGRAVVHKTLFPAGGKQCDLKPPSI

ERROR: Reference pdb file "someproteinname.pdb" not found

This error would be thrown if the 3D-mode was requested but no structure was uploaded by the user. Consequently, Zebra tries to access the PDB file which corresponds to the first protein in the multiple alignment file (and this happens to be protein someproteinname) and fails because there is no such file. The solution is simple. If you are running in the "QuickZebra+3D" mode don't forget to upload the representative PDB file together with the multiple protein alignment. If you are running in the Manual mode you should either do not enable the 3D-mode (disabled by default in the Manual mode) or enable the 3D-mode and upload the representative PDB file together with the multiple protein alignment.

ERROR: Subfamilies were not found

If the automatic subfamily classification fails it could be for two main reasons:

The proteins in your alignment are too distant;
The proteins in your alignment are too close;

If you have requested the automatic prediction of functional subfamilies (enabled by default) and this automatic classification has failed you will see the following message in the log:

WARNING: Subfamily classification into [2; 1000000000] subfamilies was not found
WARNING: Re-running search with new limits [2; 1000000001]

In this case check the log for a line like this:

INFO: Columns Valid:0 Gapped:2215 Invariant:3

Only the "Valid" columns can be used for the bioinformatic analysis and subfamily classification. A column is marked "Valid" if the content of gaps in it is below the threshold and if it is not 100% conserved. The columns with a high content of gaps (above the selected threshold) are marked as "Gapped" and the 100% conserved columns are marked as "Invariant". If the amount of "Valid" columns is zero or is very low, then the bioinformatic analysis is likely to fail due to poor information content derived from your alignment.

The situation described above (i.e., Columns Valid:0, Gapped:2215, Invariant:3) means that your proteins are too distant and thus their structures are too different (i.e., most columns have a lot of gaps), and the common core which is shared by all of the homologs is very small and totally conserved (e.g., the catalytic triad). There are two things you could do:

Set the gap threshold to a higher value. You may, in principle, set this threshold to as low as 30-50%. This will apply a less strict filter on the gap content and as a result more columns will be available for the bioinformatic analysis;
You may also try to construct a new alignment with a better coverage. Try our Mustguseal web-service to automatically construct large structure-guided sequence alignments of your protein families;

ERROR: Null columns preserved after the gaps filter INFO: Alignment columns found with at most 5% of gaps: 0 (dismissed 948 columns with a higher gap content)

This error is qualitatively similar to the one discussed above, i.e. your alignment has too many gaps as a result of a poor coverage. Generally speaking, you have two options:

(1) Use a better alignment with a higher content of gaps-free columns (recommended)

(2) Set the "Max content of gaps allowed on a column to be considered (%)" to a higher value

ERROR: Definition of functional subfamilies is not consistent with the alignment

This error occurs in the Manual mode when the user requests a custom subfamily classification to be implemented by the Zebra bioinformatic analysis but fails to provide a valid subfamily classification file.

A valid user-defined subfamily classification:

should assign each protein in the multiple alignment to a subfamily (i.e., the manual classification can not be used to select a subset of proteins from the alignment, it should cover all proteins);
should assign a protein to only one subfamily;
each protein in the classification file should be addressed by its ID - rank in the multiple alignment file (starting from 1, i.e., ID of the first protein is "1", not "0");
each subfamily should be represented by one line of protein IDs in the classification file;
the order of IDs within a subfamily line, as well as the order of the lines in the classification file can be arbitrary.

How to obtain protein IDs in the multiple alignment file? In Linux it is easy. Run a sequence of shell commands:

cat alignment.fasta | grep '^>' | nl

This command will produce the following output:

Thus, all IDs in a range 1-148 should be used in the subfamily classification file.

Example. If you multiple alignment file contains six proteins (i.e., IDs are #1, #2, #3, #4, #5, #6) below are examples of valid classification files for Zebra. Classify the sequences into two groups (the first three proteins and the last three proteins):

1 2 3 4 5 6

4 5 6 1 2 3

6 5 4 3 2 1

And the following classification files are invalid.

One protein (#6) is assigned to both groups:

6 5 4 6 3 2 1

One protein (#6) is not assigned to any group:

5 4 3 2 1

More proteins (#7-10) are assigned to groups then are present in the alignment:

6 5 4 3 2 1 7 8 9 10

ERROR: Internal error while calculating structure-based neighbour list (0 neighbours found)

This error usually happens when the representative protein, which was submitted as a PDB file to the server, has poor similarity (local or global) with the reference protein in the multiple sequence alignment.

The PDB file is expected to represent one protein from the sequence alignment, i.e. the two should be identical in the amino acid sequence. You should always aim at submitting a representative PDB structure that corresponds to a protein which is highly similar to the reference protein sequence in the alignment (i.e., >95%). For your convenience Zebra implements the preprocessing step to automatically select the reference protein in the multiple alignment that has the highest pairwise sequence similarity with the representative protein in the PDB. All mismatching regions between the representative protein structure in the PDB and the reference protein sequence in the alignment are automatically removed. Significant changes to the representative protein structure and the reference protein sequence could be introduced at this step if the representative protein sequence has low similarity to the reference protein sequence. The low sequence similarity can be global, i.e. the two proteins are remote homologs, or local, e.g. the sequence and structure belong to the same protein but some flexible loop is only partially resolved in the PDB. As a result of these changes the automatically updated PDB structure can contain gaps in the backbone. If a residue in the structure has no neighbours within the 3D-mode cut-off radius (4A by default) then the bionformatic analysis fails.

There are three ways to resolve this problem:

(1) Quick and easy workaround - can make it work but is likely to introduce bias and reduce the scientific value of the results, should be used for information purposes only. Open the log file and go to the "Step 3: Preprocessing of the Input Data" section. Download the PDB structure of the representative protein after preprocessing (you can find the link at the end of the section). Have a look at the annotated pairwise alignment between the "pdb" and the "best_match". Find the "lonely" residues in the "pdb" and delete them from the downloaded PDB file. E.g., >pdb ANKGGPSEGA means that only the P was preserved in the PDB file after preprocessing and all other residues were dismissed. This Proline is likely to cause the error when using the 3D-mode because it seems to be too far from other residues in the 3D space. Check that the Proline is also "lonely" in the structure by looking at the corresponding PDB file and then remove it (delete the coordinates of the corresponding atoms) from the PDB file. Submit a new task to the server;

(2) Find a better representative protein in the PDB database. BLAST your multiple alignment versus the sequences of proteins in the PDB database and select a better match than your current representative protein;

(3) Build a 3D model of the representative protein based on the available structural data. You can reconstruct the missing loops in the globule or predict the entire structure using the homology modeling. The highly capable Modeller software can do both. If you do not know how to use the Modeller to build a model of the representative protein for Zebra/pocketZebra you can contact us and we will e-mail you the template scripts.

ERROR: Neighbour list for acid with alignment position XXX contains pdb position YYY not presented in the alignment

This error means that there is something wrong with your PDB file format. E.g., coordinates for a ligand or a non-standard amino acid residue are provided in the "ATOM" field instead of "HETATM"; or amino acid record in your PDB file is incomplete, e.g. the coordinates of the 'CA' atom of a certain residue are missing. This problem with your PDB file has to be corrected manually.

First, generally check your PDB file for ligands and non-standard amino acid residues - these must be provided in the "HETATM" field. In particular, check the "Step 4" log file for warnings like this:

INFO: Non-canonical acid detected A SIA 501

Usually, such cases are dismissed at the PDB validation/preprocessing step, unless they do not contain a "CA" atom type. A quick workaround would be to delete the corresponding item from the PDB or to make sure that its coordinates are provided in the "HETATM" field. If this did not help, proceed as further discussed below.

Open the PDB file in the 3D structure viewer (e.g., PyMol) or in the text editor and find the YYY'th amino acid, i.e., the YYY'th amino acid record in the file (the count starts from 1). Please note that the residue ID of the YYY'th amino acid would be different from "YYY" if the residue numbering in the PBD does not start from 1 and/or some residues are missing in the backbone. Check if the record for the YYY'th amino acid is incomplete (if not, you might have lost your count), e.g.:

ATOM 675 N ALA A 84 30.430 1.654 28.087 1.00 11.02 N ATOM 676 CA ALA A 84 31.779 1.682 27.569 1.00 13.03 C ATOM 677 C ALA A 84 32.111 2.986 26.814 1.00 14.26 C ATOM 678 O ALA A 84 33.266 3.161 26.456 1.00 13.99 O ATOM 679 CB ALA A 84 32.036 0.525 26.603 1.00 16.45 C ATOM 680 N LEU A 85 31.167 3.874 26.582 1.00 14.45 N ATOM 688 N SER A 86 33.067 6.758 25.759 1.00 15.08 N ATOM 689 CA SER A 86 33.859 7.861 26.275 1.00 16.99 C ATOM 690 C SER A 86 32.871 8.932 26.675 1.00 18.70 C ATOM 691 O SER A 86 31.712 9.057 26.205 1.00 16.46 O ATOM 692 CB SER A 86 34.744 8.419 25.138 1.00 17.40 C ATOM 693 OG SER A 86 33.883 9.078 24.215 1.00 17.15 O

In the example above the record of the LEU85 is incomplete as the coordinates for most atoms (including the 'CA' atom) are missing. Once you find the residue in question you have the following possible actions: (1) remove (delete) the entire residue from the PDB file and submit a new task; (2) append the coordinates of the missing atoms (e.g., with the help of Modeller molecular modeling package) to the PDB file and submit a new task. In this case you may need to re-build the entire multiple alignment to include this residue.