ColabFold

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • NVIDIA Libraries:

  • Extension:

type access

  • Operating System:

  • Terminal:

  • Shell:

  • Editor:

  • Package Manager:

  • Programming Language:

  • NVIDIA Libraries:

  • Extension:

ColabFold is an innovative tool for protein structure prediction based on Google DeepMind’s AlphaFold2. Designed to enhance accessibility for researchers, ColabFold leverages deep learning models trained on multiple sequence alignments (MSA) and experimentally verified protein structures to accurately predict novel protein structures from amino acid sequences.

By integrating the rapid homology search capabilities of MMseqs2 with the powerful prediction algorithms of AlphaFold2, ColabFold accelerates the prediction of protein structures and complexes.

Protein structure inference is a crucial tool for various applications, including the comparison of candidates in protein-protein interaction studies, prediction of domain-specific secondary and tertiary structures, and visualization of proteins lacking experimental structures. For more details, visit the ColabFold GitHub page.

Note

AlphaFold2 and ColabFold do not support multiple GPUs; only one GPU can model a protein.

Runtime

ColabFold is deployed in four different modes to suit various research needs:

  1. Lab:

    • Runs AlphaFold2 using MMseqs2 in an interactive JupyterLab session.

    • Local Colabfold database stored in /colabfold/databases.

    • AlphaFold2 weights stored in /colabfold/params.

    • Tutorial notebooks available in /colabfold/notebooks.

  2. Search:

    • Runs a local MMseqs2 installation for MSA generation and should be used in big projects (>2000 amino acids); for smaller one-time projects this step can be skipped.

    • Databases for MMseqs2 search available in /colabfold/databases.

    • Generates MSA for large scale structure/complex predictions.

    • Inputs .fasta file and outputs .a3m file.

    Hint

    CPU intensive process. Use a machine type with CPU and RAM resources equal to or greater than that of u1-standard-16.

  3. Prediction:

    • Runs structure inference using the MSA result file from Colabfold: Search.

    • Can use public MMseqs2 server for MSA if supplied with one or more .fasta files.

    • Takes input directory containing .fasta or .a3m files:

      • For single query .fasta files, generates structure predictions using MMseqs2 search results from the MSA server database.

      • For .a3m files, uses supplied MMseqs2 search results for predictions.

    • Outputs ranked model structure files (.pdb) and confidence assessment figures (.png) scoring PAE, PLDDT, and MSA coverage.

    • Local PDB database stored in /colabfold/databases/pbd/divided.

    Hint

    The stucture inference process requires GPU resources for reasonable computational time.

  1. Split MSA: Splits a single search into one .a3m file per MSA.

Workflow Example

An example workflow involves recreating parts of the structure shown in the PDB entry 6WSL, which includes Alpha- (TUBA1A) and Beta tubulin (TUBB3), and Tubulinyl-Tyr carboxypeptidase (VASH1). Using Uniprot.org, the canonical protein sequences are located and combined to construct a FASTA file for a single complex query. For instance, consider the following FASTA file:

>TRY1_monomer
MNPLLILTFVAAALAAPFDDDDKIVGGYNCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAGHCYKSRIQVRLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTAPPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVCNGQLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAANS
>TRY1_PABPN1_complex
MNPLLILTFVAAALAAPFDDDDKIVGGYNCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAGHCYKSRIQVRLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTAPPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVCNGQLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAANS:MAAAAAAAAAAGAAGGRGSGPGRRRHLVPGAGGEAGEGAPGGAGDYGNGLESEELEPEELLLEPEPEPEPEEEPPRPRAPPGAPGPGPGSGAPGSQEEEEEPGLVEGDPGDGAIEDPELEAIKARVREMEEEAEKLKELQNEVEKQMNMSPPPGNAGPVIMSIEEKMEADARSIYVGNVDYGATAEELEAHFHGCGSVNRVTILCDKFSGHPKGFAYIEFSDKESVRTSLALDESLFRGRQIKVIPKRTNRPGISTTDRGFPRARYRARTTNYNSSRSRFYSGFNSRPRGRVYRGRARATSWYSPY

In this file, each query is described using two lines:

  • > followed by a query name containing no spaces, e.g., TRY1_monomer.

  • One or more amino acid sequences. Complex queries separate domains using a colon symbol, e.g., …ANS:MAA….

Using default settings, the FASTA file containing the queries for TUBA1A, TUBB3, and VASH1 is inputted into ColabFold: Search, which runs an MSA on the query sequence against the local MMseqs2 database and stores the result in an alignment file (.a3m). This file is returned once the run is complete.

Next, ColabFold: Prediction is run with default settings, using the MSA alignment file as input for structure inference. Upon completion, the output folder contains .pdb files of the predicted structures for each model used in the run (by default, 5), along with figures assessing the confidence of the predictions.

Benchmarks

Database searching

Colabfold: Search

Machine Type    

    Protein Monomer   
  247 Amino Acids - w/ Templates  

    Protein Complex    
  752 Amino Acids - w/ Templates  

u1-standard-16

~50 min

~40 min

u1-fat-16

~30 min

~50 min

u1-standard-32

~30 min

~35 min

u1-fat-32

~15 min

~40 min

u1-standard-64

~30 min

~40 min

u1-fat-64

~20 min

~30 min

Benchmarks were conducted using default settings.

Structure prediction

Colabfold: Prediction

Machine type   

   Protein Monomer   
247 Amino Acids

   Protein Complex   
752 Amino Acids - w/ Templates

   Protein Complex   
2977 Amino Acids

u1-gpu-1

~2 min

~20 min

>24 h

u2-gpu-1

~2 min

~6 min

~5 h

u3-gpu-1

~2 min

~4 min

~1 h 30 min

Benchmarks were conducted using default settings and alignment/template files generated using Colabfold: Search.