Variant Effect Predictor’s command line program may seem intimidating to beginners, but is actually not that bad. Keep reading to see what I have learned from my experience working with command line VEP.
What is VEP?
Variant Effect Predictor (VEP) is the industry standard software tool for many next-generation sequencing (NGS) analysis pipelines that predicts the clinical effect of a specific genetic variant. It is an important tool to know how to use for any bioinformatician or biologist working with anything related to cancer or any other genetic disease.
Here are some effective methods of reducing the runtime of command line VEP in a Linux environment.
VEP gives you the option to download a cache file to the computer you are running it from when you run
INSTALL.pl during installation. Do this. Not only is this mandatory for users working with private medical data, it also means VEP will only be using local files instead of trying to access Ensembl’s files over the internet.
The cache files for the human genome is somewhat large at around 30 GB, but for other species is much smaller at around 2.5 GB for mouse and less than 1 GB for rat. Also, if you don’t have enough space in your home directory (default location), you can run the installer like this to specify a different location.
perl INSTALL.pl --CACHEDIR /path/to/install/cache
When you run VEP, be sure to also specify this non-default location.
vep --dir_cache /path/to/cache
Multithreading sounds difficult but is actually extremely easy to do when you’re not writing the code yourself. VEP can run with multiple threads by just running it with the
vep --fork $(nproc)
$(nproc) is a bash command to find the number of processors on your computer. Modern CPUs are powerful enough for you to use all processors for VEP and still be able to do other lighter tasks while it runs, so you don’t need to worry about that. If you want to use less processors, you can simply replace
$(nproc) with the actual number you want.
Sort input files
Sorting input files will increase the runtime of VEP significantly since the program will be able to read the cached reference files sequentially rather than one variant at a time. To take advantage of this your input file will need to be sorted by chromosome first, then by position. For a VCF file, here is how to do that with the built-in
sort -k1,1 -k2,2 -Vs /path/to/unsorted/vcf
-k1,1first sort by columns 1 through 1
-k2,2then sorts by column 2 through 2
-Vtells it to do a natural sort
-smeans if there are duplicates, keep the original order
You can convert the cache to an indexed version with the supplied convert_cache.pl script in the ensembl-vep repository. More information.
perl convert_cache.pl -species [species] -version [vep_version]
These were just some of the most effective optimizations I found. There are even more ways of making VEP faster that may also take more time to set up. Here is Ensembl’s official documentation on optimizations if you are interested. Ensembl’s Official Documentation