Set up GATK Working Environment on High Performance Computing Systems

Keren Xu

2020/09/24

Genome Analysis Toolkit (GATK)  is often used in variant discovery in high-throughput sequencing data. This post is about how to set up a GATK working environment on High Performance Computing (HPC) Systems.

This post GATK on local HPC infrastructure documents several prerequisites that we need to install before installing GATK. For instance, JAVA is needed for GATK. Details on how to install JAVA can be found here.

Data carpentry  offers a genomics workshop, which could be useful to learn before starting all these installation.

This post (How to) Install all software packages required to follow the GATK Best Practices gives more detailed instruction on installing softwares that are needed for GATK variant calling best practices. These softwares include BWA, SAMTools, Picard, IGV, R, etc.

Of note, GATK requires several R packages to be installed beforehand to ensure that plots can be generated successfully in GATK best practices. The default place for R to install its packages is always the $HOME directory. To make sure that these R packages do not clog up the home directory, we need to set a difference place to hold libraries.

Here are the procedures to change the default directory to install r packages:

First, use usethis to open .rprofile

usethis::edit_r_profile()

In the .rprofile, store the current library paths to myPaths

myPaths <- .libPaths()   # get the paths

Then give the new library path (e.g. /project/PI/user/R_packages/) to myPaths[2] or the last element in the myPaths vector.
After this, switch the order of these elements in the myPaths vector - bring the last element up to the front.

myPaths <- c(myPaths[2], myPaths[1])  # switch them

Lastly, reassign the paths.

.libPaths(myPaths)  # reassign them 

We can now install R packages, which will go to the desired place that we just set up.

Then how about python libraries?

A great option mentioned by the GATK team is using miniconda to manage the environment, including GATK packages.

First, this post mentions how to install miniconda on hpc.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh ./Miniconda3-latest-Linux-x86_64.sh 

Log out and back into the cluster and run which python to see if miniconda is installed successfully. Now you should see that we are in the conda base environment.

This post (How to) Install and use Conda for GATK4 tells us how to use miniconda to create a gatk environment.

After following this post, type in

conda-env list

We should see two enviroments available right now on hpc.
drawing

The active env is base right now, we can use source activate gatk to switch to the gatk env.

Apart from installing the tar.gz file to install gatk, we can also install it through Singularity and docker.