Mustguseal Performance



Mustguseal limits the size of the core structural alignment to at most 16-150 proteins, depending on the input mode and the database selected for the sequence similarity search, in order to increase the overall performance of this service and provide it to as many people as possible. The set of 16-150 proteins in the core structural alignment can represent 16-150 structurally and functionally diverse protein families. As a result, multiple alignments of tens of thousands of proteins representing large superfamilies can be constructed using this public web-server in Modes 1, 2, or 3.



The following settings are currently implemented to increase the overall performance of this service and provide it to as many people as possible:


General notes on the Mustguseal Performance

  • The running time of a Mustguseal task and the size of a final alignment will depend on the particular input, parameter setup, and availability of data in the PDB, Swiss-Prot, and TrEMBL databases;
  • The Mode 1 is the default, fully automated, and the easiest way to obtain the alignment by submitting PDB and chain IDs of a query protein. It also takes more time to complete because all steps of the Mustguseal protocol are executed to collect and align the related sequences and structures from the selected databases;
  • The structure similarity search in Mode 1 works more efficiently, if the requested structural similarity thresholds are higher, i.e., within a range 70-100%, and the Step 1 will take longer when the percentage of secondary structure equivalences is set to 30-40%;
  • All pairwise comparisons that were once created during the structure similarity search in Mode 1 are hashed into a PostgreSQL-controlled database to be re-used in the consequent searches. When a new task is submitted in Mode 1 with the same query structure, the Step 1 (i.e., the structure similarity search) takes only a few seconds to complete because the results are not re-computed but restored from the database, which is hosted on a very fast solid-state drive. This provides an opportunity for the user to refine the alignment by submitting a new task with the same query but different parameters and getting the results significantly faster;
  • The user can choose to perform sequence similarity searches either in the Swiss-Prot database or deal with the much larger dataset basing on Swiss-Prot+TrEMBL databases;
  • The redundancy filter threshold has a direct impact on the speed of a sequence similarity search – a value below 80% is the fastest option, and 100% is the slowest option – because a pre-calculated non-redundant sets of Swiss-Prot and TrEMBL databases are actually being used by the server, and the nr80 database set is smaller in size compared to the nr100 database set (see The Parameters above);
  • Runtime of a task submitted in Mode 2 is on average at least two times faster than in Mode 1 because the time consuming Steps 1 and 2 (i.e., the structure similarity search and construction of the structural alignment) are skipped;
  • A task submitted in Mode 3 takes between several seconds to several minutes;
  • Mustguseal is a complex bioinformatic pipeline optimized for heterogeneous computing. The computer science of the Mustguseal is discussed in a separate publication [Suplatov D. et al. (2019) High-Performance Hybrid Computing for Bioinformatic Analysis of Protein Superfamilies. In: Voevodin V., Sobolev S. (eds) Supercomputing. RuSCDays 2019. Communications in Computer and Information Science, vol 1129. Springer, Cham DOI: 10.1007/978-3-030-36592-9_21].

 


Performance in Mode 1

Selected sequence similarity search database The maximum size of the
core structural alignment
What happens
if you exceed the limit?
Swiss-Prot 32 (default) or 16 proteins The first 32/16 proteins most similar to the query will be automatically selected
Swiss-Prot+TrEMBL 16 (default) or 32 proteins The first 16/32 proteins most similar to the query will be automatically selected


In Mode 1 the proteins for the core structural alignment are selected automatically based on the results of the structure similarity search. At this step the Mustguseal protocol selects a representative set of not more than 16/32 protein structures (depending on the database selected for consequent sequence similarity searches) by clustering their corresponding sequences at different pairwise similarity thresholds in a range from 95% to 40%. The selected representative structures are further aligned by means of structural superimposition to create the core structural alignment. If the size of the smallest set of representative structures (i.e., produced by clustering at the 40% threshold) will exceed 16/32, then the first 16/32 proteins most similar to the query would be automatically selected. In this case a warning message will appear on the log. In turn, each representative protein will be used as a query to execute a sequence similarity search. Sequence similarity search in the much larger Swiss-Prot+TrEMBL database is significantly slower; this explains why the limit for the number of proteins in the core structural alignment depends on the database selected for the sequence similarity search.

Since June, 2020 limits with the TrEMBL database were lifted. A specific option is now available entitled "Select at most representative proteins" to choose the maximum size of the core structural alignment between 16 and 32 when running in the Mode 1. A brief explanation of this option is available here. Nevertheless, we urge the users to be reasonable, e.g., adding another 16 representative proteins to your core 3D-structural alignment can add another ~16'000 protein sequences to your final alignment when using TrEMBL (i.e., if the "Maximum number of sequences to collect in each subsearch" parameter was set to "1000").

 

Performance in Mode 2

Selected sequence similarity search database The maximum size of the
core structural alignment
What happens
if you exceed the limit?
Swiss-Prot 64 proteins The task will be rejected
Swiss-Prot+TrEMBL 32 64 proteins The task will be rejected


In Mode 2 the core structural alignment is submitted by the user and each protein in that alignment is used as a query to execute a sequence similarity search. Sequence similarity search in the much larger Swiss-Prot+TrEMBL database is significantly slower. Thus, to increase the overall performance of this service the number of the Swiss-Prot scans is limited to at most 64 per task, and the number of the Swiss-Prot+TrEMBL scans is limited to at most 32 64 per task. Would the user-submitted core structural alignment contain more proteins the task would be rejected. In this case you can split a large core structural alignment into several text files (i.e., copy-paste the names, sequences, and gaps of the first 32 64 proteins into the first text file, then copy-paste the names, sequences, and gaps of the next 32 64 proteins into the second text file, etc.), submit each file as a separate task in Mode 2 to collect the sequence alignment blocks for each representative protein, and then submit the entire core structural alignment and all sequence alignment blocks in Mode 3.

Since June, 2020 limits with the TrEMBL database were lifted. Nevertheless, we urge the users to be reasonable, e.g., adding another 32 representative proteins to your core 3D-structural alignment can add another ~32'000 protein sequences to your final alignment when using TrEMBL (i.e., if the "Maximum number of sequences to collect in each subsearch" parameter was set to "1000").

 

Performance in Mode 3

MAPU enabled? The maximum size of the
core structural alignment
The maximum size of each sequence alignment block The maximum total size of the final alignment
No (default) 150 proteins 500 proteins 15000 proteins
Yes 150 proteins 5000 proteins 50'000 100'000 proteins


Submissions exceeding these thresholds will be rejected. These limitations in Mode 3 were introduced to prevent intentional abuse of the service. It seems impractical to build alignments larger than 15'000 proteins as they are likely to contain redundant information and would be computationally hard to analyze. However, would you like to build larger alignments for a particular purpose you could enable the use of the MAPU tool - this would increase the upper limits for sequences per alignment block and in total to 5'000 and 50'000, respectively. Please see this page for more information on how to enable the use of the MAPU tool. Since June, 2020 limits for Mode 3 were lifted up to 100'000 proteins when using MAPU.