Converting BAM files to fastq

There are a lot of resources on the internet that describe how to convert bam files to fastq files for realignment, but many of them are using different methods and different tools. I have tested and evaluated many of these conversion tools and here are my findings.


Why is it important to choose the right tool?

To preface, converting a .bam file to a .fastq file may seem to be a trivial task since they both contain the same information, but the .bam file is aligned to a reference genome rather than just scattered reads. It turns out there are several edge cases that as a whole do not happen very often, but on a NGS scale can become abundant due to the amount of reads.

Therefore, although most tools that convert from bam to fastq are able to do what they do, there are slight differences in how these tools handle these edge cases that make some better than others.

Which is the best tool?

The tools I tested are samtools 1.9’s fastq, Picard 2.10.9’s SamToFastq, and biobambam 2.0.106. In my testing, I have found that samtools delivers the best results with the least amount of system resources and time.

Samtools bam to fastq workflow

Now that we have chosen which tool to use, we need to know how to properly use it. Samtools thankfully has very robust documentation and support, which makes it one of the easier packages to work with.

Official documentation

Converting the entire BAM file to fastq for realignment to a new reference genome

If you have an old bam file that was aligned to an older build of the reference genome, you may want to realign the reads to an updated reference. In the case of the human genome, this is a common problem that people have right now since chromosomal coordinates are completely different between the old GRCh37 and the new GRCh38 reference assemblies. This makes analysis much more difficult since, for example, the location of genes in GRCh37 are different in GRCh38.

To realign the bam file, you will need to convert to fastq and then run alignment again.

Step 1. Sorting by read name

This step is critical since the resulting paired-end fastq files need to be in pairs. You sort the bam file like this:

samtools sort -n -@ $(nproc) -o ${sorted_bam} ${original_bam}
-n  Sort by read names (i.e., the QNAME field) rather than by chromosomal coordinates.
-@  Set number of sorting and compression threads. By default, operation is single-threaded. $(nproc) will give the number of actual processors you have.
-o  Write the final sorted output to FILE, rather than to standard output.

Samtools sort documentation

Step 2. Converting to fastq

Now that the bam file is sorted, we can now convert it to a fastq.

samtools fastq -@ $(nproc) -1 ${fastq1} -2 ${fastq2} -s ${singleton} -0 ${ambiguous} ${sorted}
-1  Write reads with the READ1 FLAG set (and READ2 not set) to FILE instead of outputting them. If the -s option is used, only paired reads will be written to this file.
-2  Write reads with the READ2 FLAG set (and READ1 not set) to FILE instead of outputting them. If the -s option is used, only paired reads will be written to this file.
-s  Write singleton reads to FILE.
-0  Write reads where the READ1 and READ2 FLAG bits set are either both set or both unset to FILE instead of outputting them.

Samtools fastq documentation

Step 3. Align to new reference genome

Now that you have fresh fastq files, you are ready for alignment. There are many different alignment algorithms and programs designed for different purposes, so I will not cover that here.’

Converting a region of the bam file to fastq

If you have a bam file where you only want reads from specific regions, you can do that with samtools view. Note: run this command before you sort the bam file by read name otherwise you will have trouble finding the region.

samtools view -h -b ${original_bam} chr2:100-1000 > ${region_bam}
-h  Include the header in the output.
-b  Output in the BAM format.

Samtools view documentation