What file formats does minimap2 accept and produce?

Minimap2 is a widely used sequence alignment tool in genomics, known for its speed and versatility across different types of sequencing data. To perform alignments effectively, it requires specific input file formats that contain either raw nucleotide sequences or pre-assembled data. Understanding the types of files Minimap2 can read—such as FASTA and FASTQ—is essential for proper usage. These formats allow the tool to process short reads, long reads, or entire assemblies, depending on the user’s research goals and data sources.

Equally important is the type of output Minimap2 generates after alignment. The tool primarily produces alignment data in the SAM (Sequence Alignment/Map) format, which can be further converted to BAM or CRAM for compressed storage. These output formats are widely supported in downstream bioinformatics tools, enabling detailed analysis like variant calling or transcriptome quantification. Knowing what file formats Minimap2 outputs helps users integrate it seamlessly into larger genomics pipelines and workflows.

Input File Formats Accepted by Minimap2

Understanding Input File Formats in Minimap2

When working with sequence alignment tools like Minimap2, it’s essential to understand the types of file formats it can process as input. This ensures compatibility, improves performance, and helps you integrate Minimap2 efficiently into bioinformatics workflows. Minimap2 accepts a few key file formats, each serving specific purposes depending on whether you are mapping raw reads, aligning contigs, or performing genome-to-genome comparisons.

FASTA Format for Sequence Alignment in Minimap2

The FASTA file format, typically identified by extensions such as .fa or .fasta, is one of the most commonly used formats in genomics. This format contains plain nucleotide or protein sequences, each preceded by a header line that begins with a “greater than” symbol. In the context of Minimap2, FASTA files can be used as either the query or the reference input, offering flexibility for different alignment scenarios. Researchers often use FASTA format when aligning long-read data or working with assembled genomes and transcriptomes. Because the format lacks quality scores, it is best suited for data that has already been processed or assembled.

FASTQ Format for High-Throughput Sequencing Input

The FASTQ file format, often seen with extensions like .fq or .fastq, is designed for storing raw sequence data along with quality scores. Each sequence entry in a FASTQ file consists of four lines: a header, the raw sequence, a separator line, and a quality string. This makes FASTQ ideal for representing high-throughput sequencing reads from technologies such as Illumina, PacBio, and Oxford Nanopore. Within Minimap2, FASTQ files are primarily used as the query input, especially in read mapping workflows. Because the quality information is retained, using FASTQ as an input allows for better alignment accuracy, particularly when working with noisy long-read data from third-generation sequencing platforms.

MMI Indexed Reference Files for Efficient Reuse

The .mmi file format in Minimap2 represents a pre-computed index of a reference genome or transcriptome. When a FASTA reference is first processed by Minimap2, the tool creates an index file in .mmi format, which significantly speeds up future alignment runs involving the same reference. This format is particularly useful in repetitive workflows where multiple datasets need to be aligned to the same genome. Instead of building the index from scratch each time, users can load the .mmi file directly, reducing computational overhead and improving efficiency. This format is strictly used as the reference input, not the query, and is a cornerstone of scalable genomic data processing with Minimap2.

Conclusion: Choosing the Right Input Format for Minimap2

Selecting the appropriate input file format in Minimap2 depends on the nature of your sequencing data and the specific goals of your alignment task. Use FASTA for processed sequences, FASTQ for raw reads with quality scores, and MMI for fast, repeatable alignments to the same reference. Understanding these input formats helps streamline your analysis and ensures optimal performance from Minimap2.

Output File Formats Produced by Minimap2

Exploring Output File Formats in Minimap2

Minimap2 is a powerful and widely used sequence alignment tool, designed to handle a variety of sequencing technologies and alignment tasks. One of its core strengths lies in the flexibility of its output file formats. These formats enable integration into diverse bioinformatics pipelines, whether for read mapping, variant calling, or genome assembly. Understanding how Minimap2 structures its output is essential for efficient downstream analysis and tool compatibility.

SAM Format in Minimap2 for Detailed Alignment Representation

The Sequence Alignment/Map format, commonly referred to as SAM and saved with a .sam file extension, serves as the default output format when running Minimap2 with the -a option. This format is designed to store comprehensive alignment information in a plain-text, human-readable layout. Each line in a SAM file corresponds to a single read alignment and includes critical fields such as the read name, mapping position, alignment quality, and the CIGAR string, which encodes the alignment’s structure.

Minimap2’s support for SAM output makes it highly compatible with other popular tools in the bioinformatics ecosystem. For example, the SAM file generated by Minimap2 can be directly piped into samtools, a suite of programs designed for manipulating alignments in the SAM, BAM, or CRAM formats. This compatibility allows users to quickly convert SAM files to compressed BAM files, perform sorting by coordinates or read names, index alignments for fast retrieval, and carry out various filtering steps required in variant calling workflows.

The comprehensive nature of the SAM format means it is particularly useful in applications that demand base-level accuracy and full alignment details, such as genome resequencing or transcriptome analysis. Although it can be larger in file size compared to more compact formats, its widespread adoption and readability make it a staple output choice for many researchers using Minimap2.

PAF Format in Minimap2 for Efficient and Scalable Alignment Output

Minimap2 also supports output in the Pairwise mApping Format, known as PAF. Unlike the more verbose SAM format, PAF is a lightweight, tab-delimited format designed for high-throughput alignment tasks where storage space and processing speed are critical. The PAF output is especially favored in scenarios involving long-read data, de novo genome assembly, or rapid genome-to-genome comparisons, where only summary alignment information is needed rather than per-base detail.

Each line in a PAF file represents a single mapping and contains fields that describe the query and target sequences, alignment length, and matching identity, among other metrics. This minimalistic structure allows PAF files to be easily parsed and rapidly processed, making them an efficient choice for large-scale projects. Bioinformatics tools such as miniasm, which assembles raw long reads into draft genomes, are specifically designed to work with PAF input, highlighting its utility in modern sequencing pipelines.

Because Minimap2 can produce PAF output without needing to generate full alignment strings or auxiliary tags, it is well suited for workflows that require speed and scale over granularity. Although PAF lacks the rich information found in SAM, its simplicity and performance benefits have made it a go-to format for researchers working with high-throughput long-read data.

Choosing Between SAM and PAF in Minimap2 Workflows

The choice between SAM and PAF output formats in Minimap2 depends largely on the nature of the task and the level of detail required for downstream analysis. For workflows that demand high-resolution alignments and compatibility with tools like samtools or GATK, SAM offers the most complete and flexible solution. On the other hand, for rapid alignment tasks or applications focused on assembly and structural comparison, PAF provides a streamlined alternative that reduces both storage requirements and computational load.

Minimap2 empowers researchers by supporting both formats, ensuring that users can tailor their output to the needs of their specific project, whether it’s fine-grained variant analysis or high-throughput genome assembly. By understanding how these formats function and when to use each, users can fully leverage the capabilities of Minimap2 in modern genomics.

Conversion and Compatibility

Understanding Conversion and Compatibility of Minimap2 Output Formats

Minimap2 supports multiple output formats designed for efficient integration into a variety of bioinformatics pipelines. While the default outputs such as SAM and PAF are both informative and readable, they are not always optimal for storage or direct downstream processing. To maximize performance, reduce file sizes, and ensure compatibility with standard genomic analysis tools, converting these outputs into alternative formats becomes a critical step in post-alignment workflows.

Transforming SAM to BAM or CRAM for Enhanced Compression and Speed

The SAM file format, commonly produced by Minimap2, is a human-readable text representation of sequence alignments. Although useful for inspection and debugging, SAM files tend to be large and inefficient for long-term storage or high-throughput processing. To address this, researchers often convert SAM files into BAM or CRAM formats using a utility called samtools, which is one of the most trusted tools in the field of genomic data processing. BAM is a binary version of SAM that retains all alignment data while significantly reducing file size and increasing processing speed. CRAM goes even further by compressing data using reference-based algorithms, making it ideal for situations where storage efficiency is critical. Both BAM and CRAM formats are widely accepted across bioinformatics platforms, including variant callers, genome browsers, and alignment visualizers, making the conversion from SAM essential in scalable sequencing pipelines.

Converting PAF Output for Specialized Applications Using paftools.js and Custom Parsers

When Minimap2 is run in lightweight alignment mode, it generates output in the PAF (Pairwise mApping Format), which is a simplified, tab-delimited format designed for rapid mapping summaries. While PAF is ideal for scenarios where base-level alignment details are unnecessary, such as preliminary genome comparisons or large-scale scaffolding, there may be situations that require richer data representation or compatibility with visualization tools. In these cases, conversion of PAF output becomes necessary. One efficient solution is to use paftools.js, a companion script provided with the Minimap2 toolkit, which enables manipulation and transformation of PAF files into more detailed formats. Additionally, bioinformaticians often develop custom parsers in Python, Perl, or R to adapt PAF output to the specific input requirements of their analytical tools. This flexibility ensures that PAF-based workflows remain highly adaptable, especially in complex assembly and annotation pipelines.

Seamless Integration of Minimap2 Outputs into Bioinformatics Pipelines

One of the key strengths of Minimap2 lies in its compatibility with a broad ecosystem of downstream bioinformatics tools. Whether the alignment is stored as a SAM, BAM, CRAM, or PAF file, the output can be easily imported into advanced genomic workflows for applications such as variant calling, structural variation analysis, transcript assembly, and genome scaffolding. Tools like GATK, FreeBayes, bcftools, and SPAdes rely heavily on these standard formats for accurate and efficient processing. Moreover, visualization software such as IGV (Integrative Genomics Viewer) and JBrowse support direct loading of BAM and CRAM files for intuitive examination of alignment results. The standardization of formats produced by Minimap2 ensures smooth transitions between tools, enhances reproducibility, and accelerates end-to-end sequencing analysis workflows without the need for complex file transformations.

Conclusion

Minimap2 is a highly versatile alignment tool that supports a range of widely used input formats, including FASTA for processed sequences, FASTQ for raw reads with quality scores, and MMI for pre-indexed reference genomes. This flexibility allows researchers to tailor their workflows to the data type and sequencing platform being used. By accepting both single-read and large genome files, Minimap2 accommodates diverse applications, from genome assembly and transcriptomics to read mapping and structural variation analysis, across both short- and long-read technologies.

In terms of output, Minimap2 can generate SAM and PAF files, which serve different roles in downstream processing. SAM files offer detailed, line-by-line alignment records and are ideal for further analysis and conversion into compressed formats like BAM and CRAM using samtools. PAF files provide a lightweight, efficient summary suitable for genome scaffolding and comparative genomics. Together, these output options ensure that Minimap2 integrates seamlessly with common bioinformatics pipelines, enabling scalable, high-performance sequence analysis across various research domains.