Large-scale structure-informed multiple sequence alignment of proteins with SIMSApiper

  • 0Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, 1050, Belgium.

|

|

Summary

This summary is machine-generated.

SIMSApiper is a novel Nextflow pipeline for creating reliable, structure-informed multiple sequence alignments (MSAs) of thousands of protein sequences. It significantly speeds up alignment by using structural information and parallelization, reducing gaps with conserved secondary structures.

Area Of Science

  • Bioinformatics
  • Computational Biology
  • Structural Biology

Background

  • Multiple sequence alignment (MSA) is crucial for understanding protein function and evolution.
  • Existing structure-based alignment methods can be computationally intensive and slow for large datasets.
  • Integrating structural information can improve MSA accuracy and reliability.

Purpose Of The Study

  • To develop a fast and reliable pipeline for structure-informed multiple sequence alignment.
  • To enable the alignment of thousands of protein sequences efficiently.
  • To reduce the number of gaps in MSAs by leveraging structural data.

Main Methods

  • Developed SIMSApiper, a Nextflow pipeline utilizing Python3 and Bash.
  • Incorporated user-provided or automatically retrieved structural information.
  • Implemented parallelization strategies based on sequence identity subsets.
  • Utilized conserved secondary structure elements to minimize gaps.

Main Results

  • SIMSApiper generates reliable, structure-informed MSAs.
  • The pipeline significantly outperforms standard structure-based alignment methods in speed.
  • Achieved substantial speed-up through parallelization techniques.
  • Reduced the number of gaps in alignments by effectively using secondary structure information.

Conclusions

  • SIMSApiper offers a highly efficient and accurate solution for large-scale protein sequence alignment.
  • The pipeline's ability to integrate structural data enhances MSA quality.
  • Its speed and reliability make it a valuable tool for bioinformatics research.