ColabFold¶
Operating System:
Terminal:
Shell:
Editor:
Package Manager:
Programming Language:
NVIDIA Libraries:
Extension:
ColabFold is an innovative tool for protein structure prediction based on Google DeepMind’s AlphaFold2. Designed to enhance accessibility for researchers, ColabFold leverages deep learning models trained on multiple sequence alignments (MSA) and experimentally verified protein structures to accurately predict novel protein structures from amino acid sequences.
By integrating the rapid homology search capabilities of MMseqs2 with the powerful prediction algorithms of AlphaFold2, ColabFold accelerates the prediction of protein structures and complexes.
Protein structure inference is a crucial tool for various applications, including the comparison of candidates in protein-protein interaction studies, prediction of domain-specific secondary and tertiary structures, and visualization of proteins lacking experimental structures. For more details, visit the ColabFold GitHub page.
Note
AlphaFold2 and ColabFold do not support multiple GPUs; only one GPU can model a protein.
Runtime¶
ColabFold is deployed in four different modes to suit various research needs:
Lab:
Runs AlphaFold2 using MMseqs2 in an interactive JupyterLab session.
Local Colabfold database stored in
/colabfold/databases
.AlphaFold2 weights stored in
/colabfold/params
.Tutorial notebooks available in
/colabfold/notebooks
.
Search:
Runs a local MMseqs2 installation for MSA generation and should be used in big projects (>2000 amino acids); for smaller one-time projects this step can be skipped.
Databases for MMseqs2 search available in
/colabfold/databases
.Generates MSA for large scale structure/complex predictions.
Inputs
.fasta
file and outputs.a3m
file.
Hint
CPU intensive process. Use a machine type with CPU and RAM resources equal to or greater than that of
u1-standard-16
.Prediction:
Runs structure inference using the MSA result file from
Colabfold: Search
.Can use public MMseqs2 server for MSA if supplied with one or more
.fasta
files.Takes input directory containing
.fasta
or.a3m
files:For single query
.fasta
files, generates structure predictions using MMseqs2 search results from the MSA server database.For
.a3m
files, uses supplied MMseqs2 search results for predictions.
Outputs ranked model structure files (
.pdb
) and confidence assessment figures (.png
) scoring PAE, PLDDT, and MSA coverage.Local PDB database stored in
/colabfold/databases/pbd/divided
.
Hint
The stucture inference process requires GPU resources for reasonable computational time.
Split MSA: Splits a single search into one
.a3m
file per MSA.
Workflow Example¶
An example workflow involves recreating parts of the structure shown in the PDB entry 6WSL, which includes Alpha- (TUBA1A) and Beta tubulin (TUBB3), and Tubulinyl-Tyr carboxypeptidase (VASH1). Using Uniprot.org, the canonical protein sequences are located and combined to construct a FASTA file for a single complex query. For instance, consider the following FASTA file:
>TRY1_monomer
MNPLLILTFVAAALAAPFDDDDKIVGGYNCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAGHCYKSRIQVRLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTAPPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVCNGQLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAANS
>TRY1_PABPN1_complex
MNPLLILTFVAAALAAPFDDDDKIVGGYNCEENSVPYQVSLNSGYHFCGGSLINEQWVVSAGHCYKSRIQVRLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTAPPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPGKITSNMFCVGFLEGGKDSCQGDSGGPVVCNGQLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAANS:MAAAAAAAAAAGAAGGRGSGPGRRRHLVPGAGGEAGEGAPGGAGDYGNGLESEELEPEELLLEPEPEPEPEEEPPRPRAPPGAPGPGPGSGAPGSQEEEEEPGLVEGDPGDGAIEDPELEAIKARVREMEEEAEKLKELQNEVEKQMNMSPPPGNAGPVIMSIEEKMEADARSIYVGNVDYGATAEELEAHFHGCGSVNRVTILCDKFSGHPKGFAYIEFSDKESVRTSLALDESLFRGRQIKVIPKRTNRPGISTTDRGFPRARYRARTTNYNSSRSRFYSGFNSRPRGRVYRGRARATSWYSPY
In this file, each query is described using two lines:
>
followed by a query name containing no spaces, e.g.,TRY1_monomer
.One or more amino acid sequences. Complex queries separate domains using a colon symbol, e.g.,
…ANS:MAA…
.
Using default settings, the FASTA file containing the queries for TUBA1A, TUBB3, and VASH1 is inputted into ColabFold: Search
, which runs an MSA on the query sequence against the local MMseqs2 database and stores the result in an alignment file (.a3m
). This file is returned once the run is complete.
Next, ColabFold: Prediction
is run with default settings, using the MSA alignment file as input for structure inference. Upon completion, the output folder contains .pdb
files of the predicted structures for each model used in the run (by default, 5), along with figures assessing the confidence of the predictions.
Benchmarks¶
Database searching¶
Colabfold: Search
Machine Type |
Protein Monomer |
Protein Complex |
---|---|---|
|
~50 min |
~40 min |
|
~30 min |
~50 min |
|
~30 min |
~35 min |
|
~15 min |
~40 min |
|
~30 min |
~40 min |
|
~20 min |
~30 min |
Benchmarks were conducted using default settings.
Structure prediction¶
Colabfold: Prediction
Machine type |
Protein Monomer |
Protein Complex |
Protein Complex |
---|---|---|---|
|
~2 min |
~20 min |
>24 h |
|
~2 min |
~6 min |
~5 h |
|
~2 min |
~4 min |
~1 h 30 min |
Benchmarks were conducted using default settings and alignment/template files generated using Colabfold: Search
.
Contents