Introduction
FASTQ file
FASTQ is a common file format used in bioinformatics to represent DNA sequence data along with corresponding quality scores for each nucleotide. The quality scores indicate the reliability of the base call at a particular position in the sequence.
FASTQ files can become quite large, especially when dealing with high-throughput sequencing data, for that reason compression is often applied to reduce storage requirements and facilitate faster data transfer.
Two types of compression tools can be distinguished:
- general purpose compression tools, such as gzip or bzip2.
- special purpose compression tools, deliberately made for .fastq files. Such as SPRING or CoLoRd.
General purpose compression tools
Due to historical reason and simplicity of use the most popular way to compress FASTQ files is still the usage of general purpose tools such as gzip, xz or 7zip. More over, most bioinformatic tools will accept gzipped files as input.
But general purpose tools do not fully utilize some peculiarities specific for .fastq files, which makes them good but not optimal solution for the problem. That is why specialized tool appeared, they are more efficient in terms of compression ratio and speed. For example in some cases specialized tool can be x10 times faster in compression speed and have 3 times better compression ratio.
FASTQ specific compression tools
There are several dozens specialized compression tools available today. Some of them are open source and free to use, others do have proprietary licence and have no source code available. Such instruments are usually a way more efficient than general purpose tools. But depending on specific use case different solution might work better or worse or even don't work at all.
For example, some tools might be efficient for short reads(produced by NGS sequencers such as illumina), but might not work well for a file containing long reads, which is produced by third generation sequencing machines(TGS) such as Oxford Nanopore or Pacbio HiFi.
Another tradeoff dimension is speed vs compression ratio. One might want to prioritize one over another depending on specific use case. Which could lead to the chose of different tools for different purposes.
The price is also important parameter to consider. Some might be ok with free open source solution, while others would ready to pay money for better quality and available support.
Last but not least parameter is availability of source code and its licence. Open source code allows you to check it safety and modify it in case you need.