How To Make Ensembl’s Variant Effect Predictor (VEP) Command Line Run Faster

Variant Effect Predictor’s command line program may seem intimidating to beginners, but is actually not that bad. Keep reading to see what I have learned from my experience working with command line VEP.


Navigation

What is VEP?

Variant Effect Predictor (VEP) is the industry standard software tool for many next-generation sequencing (NGS) analysis pipelines that predicts the clinical effect of a specific genetic variant. It is an important tool to know how to use for any bioinformatician or biologist working with anything related to cancer or any other genetic disease.

Reducing runtime

Here are some effective methods of reducing the runtime of command line VEP in a Linux environment.

Downloading cache

VEP gives you the option to download a cache file to the computer you are running it from when you run INSTALL.pl during installation. Do this. Not only is this mandatory for users working with private medical data, it also means VEP will only be using local files instead of trying to access Ensembl’s files over the internet.

The cache files for the human genome is somewhat large at around 30 GB, but for other species is much smaller at around 2.5 GB for mouse and less than 1 GB for rat. Also, if you don’t have enough space in your home directory (default location), you can run the installer like this to specify a different location.

perl INSTALL.pl --CACHEDIR /path/to/install/cache

When you run VEP, be sure to also specify this non-default location.

vep --dir_cache /path/to/cache

Multithreading

Multithreading sounds difficult but is actually extremely easy to do when you’re not writing the code yourself. VEP can run with multiple threads by just running it with the --fork option.

vep --fork $(nproc)

The $(nproc) is a bash command to find the number of processors on your computer. Modern CPUs are powerful enough for you to use all processors for VEP and still be able to do other lighter tasks while it runs, so you don’t need to worry about that. If you want to use less processors, you can simply replace $(nproc) with the actual number you want.

Sort input files

Sorting input files will increase the runtime of VEP significantly since the program will be able to read the cached reference files sequentially rather than one variant at a time. To take advantage of this your input file will need to be sorted by chromosome first, then by position. For a VCF file, here is how to do that with the built-in sort command.

sort -k1,1 -k2,2 -Vs /path/to/unsorted/vcf
  • -k1,1 first sort by columns 1 through 1
  • -k2,2 then sorts by column 2 through 2
  • -V tells it to do a natural sort
  • -s means if there are duplicates, keep the original order

Converting cache

You can convert the cache to an indexed version with the supplied convert_cache.pl script in the ensembl-vep repository. More information.

perl convert_cache.pl -species [species] -version [vep_version]

Further optimization

These were just some of the most effective optimizations I found. There are even more ways of making VEP faster that may also take more time to set up. Here is Ensembl’s official documentation on optimizations if you are interested. Ensembl’s Official Documentation