Genome Analysis Toolkit (GATK) is often used in variant discovery in high-throughput sequencing data. This post is about how to set up a GATK working environment on High Performance Computing (HPC) Systems.
This post GATK on local HPC infrastructure documents several prerequisites that we need to install before installing GATK. For instance, JAVA is needed for GATK. Details on how to install JAVA can be found here.
Data carpentry offers a genomics workshop, which could be useful to learn before starting all these installation.
This post (How to) Install all software packages required to follow the GATK Best Practices gives more detailed instruction on installing softwares that are needed for GATK variant calling best practices. These softwares include BWA, SAMTools, Picard, IGV, R, etc.
Of note, GATK requires several R packages to be installed beforehand to ensure that plots can be generated successfully in GATK best practices. The default place for R to install its packages is always the $HOME directory. To make sure that these R packages do not clog up the home directory, we need to set a difference place to hold libraries.
Here are the procedures to change the default directory to install r packages:
usethis to open .rprofile
.rprofile, store the current library paths to myPaths
myPaths <- .libPaths() # get the paths
Then give the new library path (e.g. /project/PI/user/R_packages/) to myPaths or the last element in the
After this, switch the order of these elements in the
myPaths vector - bring the last element up to the front.
myPaths <- c(myPaths, myPaths) # switch them
Lastly, reassign the paths.
.libPaths(myPaths) # reassign them
We can now install R packages, which will go to the desired place that we just set up.
Then how about python libraries?
A great option mentioned by the GATK team is using miniconda to manage the environment, including GATK packages.
First, this post mentions how to install miniconda on hpc.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh sh ./Miniconda3-latest-Linux-x86_64.sh
Log out and back into the cluster and run which python to see if miniconda is installed successfully. Now you should see that we are in the conda base environment.
This post (How to) Install and use Conda for GATK4 tells us how to use miniconda to create a gatk environment.
After following this post, type in
We should see two enviroments available right now on hpc.
The active env is base right now, we can use
source activate gatk to switch to the gatk env.
Apart from installing the tar.gz file to install gatk, we can also install it through Singularity and docker.