Alphafold and Colabfold
Alphafold is a novel program for protein structure prediction, using neural network run on GPUs to provide protein structures, which accuracy is comparable to laborious manual structure simulations. Colabfold uses Alphafold, but replaces the time consuming database searches with much faster, but less accurate alternatives.
Alphafold 2
Alphafold consists of two major steps, the first being the genetic database searches for the amino-acid sequence, defined by the input fasta file, which run completely on the CPUs, are very I/O intensive, don't parallelize much, and don't utilize GPUs. The second part is utilizing the pre-trained neural network coupled with molecular dynamics simulations refinement to provide the 3D protein structure in a form of a PDB file. This step uses GPUs for the neural network inference, and optionally for the molecular dynamics simulation. Therefore the scarce GPU resource is only utilized by a part of the workflow.
To make things worse, this first database search step runs very slowly, when the genetic databases are located on a network mounted storage, which is the most commonly used storage on the CHPC clusters. We investigated performance of the database search on all CHPC network file systems, and neither provides acceptable results - a small protein sequence search took 8-16 hours to complete depending on a file system used. The best alternative is to create a RAM disk on the node where the Alphafold simulation runs, and copy the small databases and indices of the large databases onto this RAM disk for fast access. This brings down the database search part in the aforementioned example to 40 minutes.
We have created a script to create the RAM disk and copy the databases to it, located
at /uufs/chpc.utah.edu/sys/installdir/alphafold/db_to_tmp_232.sh
. Note that the databases on the RAM disk occupy ~25 GB so ask for RAM for the job
accordingly. The file copy is fairly fast, it takes about a minute. Also make sure
that the databases are removed at the end of the job. Below we provide shell commands
and SLURM script that does this.
Also, since some databases are on the RAM disk and some on the network mounted storage, Alphafold must be run with options that reflect the database locations. Since we package the Alphafold distribution in a Singularity container, the Alphafold launch command gets even more complicated, however, this is all shown in the examples below. To make this simpler, we have created several wrapper scripts to make the launch command simpler.
The second, neural network / MD part, in our example, takes 12 minutes on a 1080ti GPU, while it would take 3.5 hours on a CPU using 16 cores of the notchpeak-shared-short partition.
As we can see from the above mentioned example timings, out of the 52 minutes the job ran, the GPU was utilized only for 12 minutes. For this reason, we have modified the Alphafold source to run the CPU and GPU intense parts as separate jobs. The first job does the database search only on CPUs utilizing the protein sequence databases on the RAM disk. The second job runs the GPU intensive neural network part, which does not need many CPUs and the RAM disk.
Colabfold (see below) is a reasonable alternative which uses alternative database search engine, which is less detailed but much faster.
Running Alphafold interactively
When learning to use Alphafold, or setting up a new type of simulation, we recommend to use the notchpeak-shared-short interactive queue, as that leads to quicker turnaround if errors are encountered. Once you have done that, create SLURM scripts as shown in the next section, that allows to run both the CPU and GPU parts via a single job submission.
First we submit an interactive job on the notchpeak cluster asking for 16 CPUs and 128 GB of memory to run the database search. The database search is more memory intensive, plus, we need extra memory for the RAM disk databases in order to get better performance.
salloc -N 1 -n 16 --mem=128G -p notchpeak-shared-short -A notchpeak-shared-short -t 8:00:00
Then we load the Alphafold module and set up the RAM disk databases, located in /tmp, which is a RAM disk:
ml alphafold/2.3.2
/uufs/chpc.utah.edu/sys/installdir/alphafold/db_to_tmp_232.sh
Now we are ready to run Alphafold. This can be done either with therun_alphafold.sh
command which requires explicit listing of the database location parameters, or with the
run_alphafold_full.sh
command which defines the locations of the databases, so only additional runtime parameters
need to be listed. This includes the user supplied parameters such as the FASTA input
file name, which we define through a bash shell environment variable FASTA_FILE, and
the output directory, defined by OUTPUT_DIR variable
export FASTA_FILE=ex1.fasta
export OUTPUT_DIR=my_out_dir
export SCRDB=/scratch/general/vast/app-repo/alphafold
export TMPDB=/tmp/$SLURM_JOBID
# run_alphafold.sh is an alias defined in the modulefile, requiring to list the appropriate databases
#run_alphafold.sh --data_dir=$SCRDB --uniref90_database_path=$SCRDB/uniref90/uniref90.fasta --uniref30_database_path=$TMPDB/UniRef30_2021_03 --mgnify_database_path=$SCRDB/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$TMPDB/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --pdb70_database_path=$TMPDB/pdb70 --template_mmcif_dir=$SCRDB/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$SCRDB/pdb_mmcif/obsolete.dat --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01 --run_feature=1
# run_alphafold_full.sh is an alias that includes the list of the full databases in the argument list, so one only needs to provide the run specific runtime options
run_alphafold_full.sh --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01 --run_feature=1
Notice the option use --run_feature=1
which tells the program to run only the database search, and saves a file called
features.pkl
which contains the database search results for each fasta file. Once this file is
written, the first step is finished, and we can delete this job.
For reduced databases, which are useful for larger sequences or multimers, use the
run_alphafold_full.sh
command, which points to these reduced databases.
Second we submit an interactive job on the notchpeak cluster asking for 4 CPUs to run and one GPU to run the GPU intensive part. The GPU part does not need many CPUs, and uses less memory, though with larger sequences one may need to ask for more than 16 GB that are the default for 4 CPUs on notchpeak-shared-short.
salloc -N 1 -n 4 -p notchpeak-shared-short -A notchpeak-shared-short -t 8:00:00 --gres=gpu:1080ti
Run the second part of Alphafold as
export FASTA_FILE=ex1.fasta
export OUTPUT_DIR=my_out_dir
export SCRDB=/scratch/general/vast/app-repo/alphafold
export TMPDB=/scratch/general/vast/app-repo/alphafold
#run_alphafold.sh --data_dir=$SCRDB --uniref90_database_path=$SCRDB/uniref90/uniref90.fasta --uniref30_database_path=$TMPDB/UniRef30_2021_03 --mgnify_database_path=$SCRDB/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$TMPDB/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --pdb70_database_path=$TMPDB/pdb70 --template_mmcif_dir=$SCRDB/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$SCRDB/pdb_mmcif/obsolete.dat --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01
# run_alphafold_full.sh is an alias that includes the list of the full databases in the argument list, so one only needs to provide the run specific runtime options
run_alphafold_full.sh --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01
Notice that we use --use-gpu_relax
to run the molecular dynamics (MD) relaxation on a GPU, we have noticed that on small
structures the CPU relaxation is faster, while for larger structures the GPU is faster.
Since we ask for smaller CPU count to run the GPU intensive part, we choose to run
the MD on the GPU.To see all the runtime options, run run_alphafold.sh --help
.
This launch command is for a monomer, for multimer, some of the database parameters are different. Notice that we are also using the reduced databases:
/uufs/chpc.utah.edu/sys/installdir/alphafold/db_to_tmp_232_reduced.sh
run_alphafold_red.sh --fasta_paths=$FASTA_FILE --max_template_date=2022-06-27 --output_dir=$OUTPUT_DIR --use_gpu_relax --model_preset=multimer --db_preset=reduced_dbs --run_feature=1
Note that the command above only does the CPU intensive part on the CPU, the GPU intensive part will need to be run as well with
run_alphafold_red.sh --fasta_paths=$FASTA_FILE --max_template_date=2022-06-27 --output_dir=$OUTPUT_DIR --use_gpu_relax --model_preset=multimer --db_preset=reduced_dbs
Running Alphafold in a job script
We are providing a sample SLURM scripts that in essence does the steps outlined above
at /uufs/chpc.utah.edu/sys/installdir/alphafold/run_alphafold_chpc_232.slr
for the first step, and /uufs/chpc.utah.edu/sys/installdir/alphafold/run_alphafold_chpc_232_2.slr
for the second step. Note the "232" at the end of the script which denotes the Alphafold
version. Because of the explicit launch of Alphafold from the container necessitated
by the RAM disk databases, we need to explicitly call the appropriate container version.
The databases need about 25 GB worth of RAM on the RAM disk, so make sure to add this RAM to the amount requested with the #SBATCH --mem option.
The first step SLURM script, run_alphafold_chpc_232.slr,
then looks like:
#!/bin/bash
#SBATCH -t 8:00:00
#SBATCH -n 16
#SBATCH -N 1
#SBATCH -p notchpeak-shared-short
#SBATCH -A notchpeak-shared-short
#SBATCH --mem=128G
# this script runs the first, CPU intensive, part of AlphaFold
ml purge
ml alphafold/2.3.2
# put the name of the fasta file here
export FASTA_FILE="t1050.fasta"
export OUTPUT_DIR="out"
# copy some of the databases to the RAM disk
/uufs/chpc.utah.edu/sys/installdir/alphafold/db_to_tmp_232.sh
SCRDB=/scratch/general/vast/app-repo/alphafold
TMPDB=/tmp/$SLURM_JOBID
sbatch -d afterok:$SLURM_JOBID run_alphafold_chpc_232_2.slr
# run_alphafold.sh is an alias defined in the modulefile, requiring to list the appropriate databases
#run_alphafold.sh --data_dir=$SCRDB --uniref90_database_path=$SCRDB/uniref90/uniref90.fasta --uniref30_database_path=$TMPDB/UniRef30_2021_03 --mgnify_database_path=$SCRDB/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$TMPDB/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --pdb70_database_path=$TMPDB/pdb70 --template_mmcif_dir=$SCRDB/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$SCRDB/pdb_mmcif/obsolete.dat --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01 --run_feature=1
# run_alphafold_full.sh is an alias that includes the list of the full databases in the argument list, so one only needs to provide the run specific runtime options
run_alphafold_full.sh --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01 --run_feature=1
rm -rf $TMPDB
Again, we are only asking for CPUs (as many as possible, the database search to certain
extent utilizes more CPUs). We are also submitting the second step from this job with
the -d afterok:$SLURM_JOBID
, which submits the second step job with dependence after this first CPU step finishes
correctly.
The second GPU intensive step, run_alphafold_chpc_232_2.slr,
runs on a few CPUs, needs less memory and does not use the RAM disk for the databases,
since they are not used - just need to be fed to the run_alphafold.sh command since
it checks if these databases exist. Notice also that we are not passing the FASTA_FILE
and OUTPUT_DIR environment variables, they are by default passed from the first job.
#!/bin/bash
#SBATCH -t 8:00:00
#SBATCH -n 4
#SBATCH -N 1
#SBATCH -p notchpeak-shared-short
#SBATCH -A notchpeak-shared-short
#SBATCH --gres=gpu:t4:1
#SBATCH --mem=32G
# this script runs the second, GPU intensive, part of AlphaFold
ml purge
ml alphafold/2.3.2
# FASTA_FILE and OUTPUT_DIR are brought from the previous job
# no use of databases so no need to create them in /tmp
SCRDB=/scratch/general/vast/app-repo/alphafold
TMPDB=/scratch/general/vast/app-repo/alphafold
# run_alphafold.sh is an alias defined in the modulefile, requiring to list the appropriate database
#run_alphafold.sh --data_dir=$SCRDB --uniref90_database_path=$SCRDB/uniref90/uniref90.fasta --uniref30_database_path=$TMPDB/UniRef30_2021_03 --mgnify_database_path=$SCRDB/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$TMPDB/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --pdb70_database_path=$TMPDB/pdb70 --template_mmcif_dir=$SCRDB/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$SCRDB/pdb_mmcif/obsolete.dat --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01
# run_alphafold_full.sh is an alias that includes the list of the full databases in the argument list, so one only needs to provide the run specific runtime options
run_alphafold_full.sh --use_gpu_relax --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --max_template_date=2022-01-01 -data_dir=$SCRDB --uniref90_database_path=$SCRDB/uniref90/uniref90.fasta --uniref30_database_path=$TMPDB/uniref30/UniRef30_2021_03 --mgnify_database_path=$SCRDB/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$TMPDB/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --pdb70_database_path=$TMPDB/pdb70/pdb70 --template_mmcif_dir=$SCRDB/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$SCRDB/pdb_mmcif/obsolete.dat --fasta_paths=$FASTA_FILE --output_dir=$OUTPUT_DIR --use_gpu_relax --max_template_date=2022-01-01
When running your own jobs, you may need to change the SLURM account (-A), partition (-p), memory (--mem) or GPU type (--gres=gpu), depending on the size of the job and the accounts, partitions and GPUs you have access to.
Once the two job scripts are ready submit the first one with the sbatch
command. The second job gets submitted automatically from the first job:
sbatch run_alphafold_chpc_232.slr
For the multimer with reduced databases, example script is at /uufs/chpc.utah.edu/sys/installdir/alphafold/run_alphafold_chpc_multimer_232.slr
and /uufs/chpc.utah.edu/sys/installdir/alphafold/run_alphafold_chpc_multimer_232_2.slr
. Note mainly the different databases that are being used in the command line, and
calling a different script, /uufs/chpc.utah.edu/sys/installdir/alphafold/db_to_tmp_232_reduced.sh
, to copy the databases to the RAM disk.
Alphafold 3
Alphafold 3 has a substantially updated diffusion-based architecture that is capable of predicting the joint structure of complexes including proteins, nucleic acids, small molecules, ions and modified residues. That neccessitates a different kind of input than the fasta input in Alphafold 2, described below.
IMPORTANT NOTE: Due to license restrictions for Alphafold 3 weights you need to familiarize yourself with Alphafold 3 Model Parameters and Outputs terms of use and abide to them. In short, only non-profit activity is allowed, unethical use of the outputs is disallowed and make sure to cite the Alphafold 3 paper in any publication. To gain access to Alphafold 3 at CHPC, request access to the weights by filling out this form. You will receive two e-mails. First is acknowledgement of receipt of the request form. The second, in a day or so, is the approval. Please, forward this approval with your name and uNID to helpdesk@chpc.utah.edu. This will grant you access to using Alphafold 3. You do not need to download the weights, they are part of the CHPC installation.
For the CPU and disk intensive MSA search, Alphafold 3 still uses jackhmmer and hhmsearch, the former requiring large databases that don't fit into the RAM disk - an approach we used with Alphafold 2. Fortunately, nowadays we have the VAST file system which is fairly performant over the InifiniBand network, therefore the CPU only MSA search is expected to take up to few hours, depending on the sequence size. The GPU based inference part performance also depends on the sequence size, shorter sequences taking minutes, up to several hours for longer sequences.
As with Alphafold 2, in order to better utilize in-demand GPU resources, we recommend to split the calculation into two parts, the CPU only I/O intensive MSA search, and the GPU intensive inference based structure prediction. Alphafold 3 now has explicit options for that.
It is preferable to test initial run interactively to make sure that the input file is correct and everything runs as it should. It is faster to iterate through the program launches in an interactive job.
Commands for the interactive session of the first, CPU/IO intensive MSA step, using the notchpeak cluster, are below:
salloc -N 1 -n 16 -A notchpeak-shared-short -p notchpeak-shared-short -t 8:00:00
ml purge
ml alphafold/3.0.0
run_alphafold.sh --json_path=af_input.json --output_dir=out --norun_inference
We first submit an interactive job in the notchpeak-shared-short partition. Use other
CPU only partitions as you see fit. It is not advantageous to use more than 16 CPUs
as the jackhmmer MSA search is I/O bound, with the databases being on the network
mounted VAST file system. Then we clean any modules we may have had loaded, load the
alphafold 3 module and run. The path to the databases and Alphafold parameters is
encoded in the run_alphafold.sh
alias, therefore we only need to supply the input file name and output directory.
The --norun_inference
option ensures that only the MSA search is done.
The input file in JSON format that we use in the example above is:
{
"name": "2PV7",
"sequences": [
{
"protein": {
"id": ["A", "B"],
"sequence": "GMRESYANENQFGFKTINSDIHKIVIVGGYGKLGGLFARYLRASGYPISILDREDWAVAESILANADVVIVSVPINLTLETIERLKPYLTENMLLADLTSVKREPLAKMLEVHTGAVLGLHPMFGADIASMAKQVVVRCDGRFPERYEWLLEQIQIWGAKIYQTNATEHDHNMTYIQALRHFSTFANGLHLSKQPINLANLLALSSPIYRLELAMIGRLFAQDAELYADIIMDKSENLAVIETLKQTYDEALTFFENNDRQGFIDAFHKVRDWFGDYSEQFLKESRQLLQQANDLKQG"
}
}
],
"modelSeeds": [1],
"dialect": "alphafold3",
"version": 1
}
In this particular case, it's just a protein, named 2PV7
.
After the MSA step, the output
directory contains a directory with the sequence name, out/2pv7
. Inside of this directory is a new json file that contains the MSA search results,
in our case out/2pv7/2pv7_data.json
. This file is used for the second step, that will be run on the GPU.
salloc -n 4 -N 1 -p notchpeak-gpu -A notchpeak-gpu --gres=gpu:3090:1 --mem=32G -t 2:00:00
ml purge
ml alphafold/3.0.0
run_alphafold.sh --json_path=out/2pv7/2pv7_data.json --output_dir=out --norun_data_pipeline
Here we request a single GPU from notchpeak-gpu partition. The Nvidia 3090 is a fairly
old GPU but it's sufficient for shorter sequences like this one. Alphafold 3 documentation recommends using performant GPUs like A100 - these would be needed for larger sequences,
but, since we don't have many GPUs like that, we recommend keeping them to the calculations
to which they are needed. The --norun_data_pipeline
option ensures that only the GPU based protein structure inference is run.
Once one verifies that the inputs are correct and run well, move onto the SLURM batch
scripts. Sample SLURM scripts are at /uufs/chpc.utah.edu/sys/installdir/alphafold/run_alphafold_chpc_300.slr
for the first step, and /uufs/chpc.utah.edu/sys/installdir/alphafold/run_alphafold_chpc_300_2.slr
for the second step. Note the "300" at the end of the script which denotes the Alphafold
version. The first script automatically submits the second script with the dependence
on finishing the first script, so, all that needs to be done is to modify the sample
scripts with correct input file names and output directories, and submitting the first
script, sbatch run_alphafold_chpc_300.slr
.
For the MSA search the SLURM script looks like this:
#!/bin/bash
#SBATCH -t 8:00:00
#SBATCH -n 16
#SBATCH -N 1
#SBATCH -p notchpeak-shared-short
#SBATCH -A notchpeak-shared-short
# this script runs the first, CPU intensive, part of AlphaFold
ml purge
ml alphafold/3.0.0
# put the name of the fasta file here
export INPUT_FILE="af_input.json"
export OUTPUT_DIR="out"
#submit the GPU inference job
sbatch -d afterok:$SLURM_JOBID run_alphafold_chpc_300_2.slr
# run_alphafold.sh is an alias that includes the list of the full databases in the argument list, so one only needs to provide the run specific runtime options
run_alphafold.sh --json_path=$INPUT_FILE --output_dir=$OUTPUT_DIR --norun_inference
and for the inference, it looks like this:
#!/bin/bash
#SBATCH -t 24:00:00
#SBATCH -n 4
#SBATCH -N 1
#SBATCH -p notchpeak-gpu
#SBATCH -A notchpeak-gpu
#SBATCH --gres=gpu:3090:1
#SBATCH --mem=32G
# this script runs the first, CPU intensive, part of AlphaFold
ml purge
ml alphafold/3.0.0
# change this to point to the json file from the first step
# the output subdirectory name and data file are named after the "name" from the first step json file
# e.g. "name": "2PV7",
INPUT_FILE=out/2pv7/2pv7_data.json
OUTPUT_DIR=out
# no use of databases so no need to create them in /tmp
SCRDB=/scratch/general/vast/app-repo/alphafold3
# run_alphafold.sh is an alias that includes the list of the full databases in the argument list, so one only needs to provide the run specific runtime options
run_alphafold.sh --json_path=$INPUT_FILE --output_dir=$OUTPUT_DIR --norun_data_pipeline
# to run on older GPUs, add --flash_attention_implementation=xla
Colabfold
Colabfold is an adaptation of Alphafold run using the Google Colab cloud, which includes modified database search resulting in faster performance. Since running Jupyter notebooks on a Google Colab cloud infrastructure may be impractical for our users, we have set up an adaptation of Colabfold, called localcolabfold, which allows Colabfold to run locally, e.g. on an HPC cluster.
The database search is done on a shared remote server, which means that with increased usage this remote server may become a bottleneck. For that reason please be judicious with submitting Colabfold jobs. Once we reach a point of high use, we may need to look into setting up a dedicated local server for the database searches.
To run Colabfold, we load the module and run the command:
ml colabfold
export FASTA_FILE=ex1.fasta
export OUTPUT_DIR=my_output_dir
colabfold_batch --amber --templates --num-recycle 3 --use-gpu-relax $FASTA_FILE $OUTPUT_DIR
These commands can be either typed after one starts an interactive GPU job, as shown at the Alphafold interactive example above, or put into a SLURM script, replacing the Alphafold module and commands in the SLURM script shown above. The SLURM script would then look like this:
#!/bin/bash
#SBATCH -t 8:00:00
#SBATCH -n 16
#SBATCH -N 1
#SBATCH -p notchpeak-shared-short
#SBATCH -A notchpeak-shared-short
#SBATCH --gres=gpu:1080ti:1
ml colabfold
export FASTA_FILE=ex1.fasta
export OUTPUT_DIR=my_output_dir
colabfold_batch --amber --templates --num-recycle 3 --use-gpu-relax $FASTA_FILE $OUTPUT_DIR