Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarizes results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Quality Control - Raw read QC and trimming, FastQC and fastp reports for input data assessment
- Alignment - Read alignment to reference genome (BAM/CRAM)
- Preprocessing - GATK preprocessing steps following best practices
- Variant Calling - Somatic variant detection, VCF files from multiple callers (Mutect2, Strelka2, SAGE).
- Annotation - Variant annotation with VEP
- Consensus - Consensus variant calling across multiple callersin MAF format
- Filtering - Variant filtering and quality control in MAF format
- Realignment - Optional RNA-specific realignment
- MultiQC - Aggregate report describing results and QC
- Pipeline information - Report metrics generated during the workflow execution
Quality Control
FastQC
Output files
fastqc/
*_fastqc.html
: FastQC report containing quality metrics.*_fastqc.zip
: Zip archive containing the FastQC report, tab-delimited data file and plot images.
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.
NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.
fastp
Output files
fastp/
*.fastp.html
: fastp report in HTML format.*.fastp.json
: fastp report in JSON format.*.fastp.log
: fastp log file.*.fastp.fastq.gz
: Trimmed FastQ files (if trimming is enabled).
fastp is a tool designed to provide fast, all-in-one preprocessing for FastQ files. It is developed in C++ with multithreading supported to afford high performance. fastp is used in this pipeline for quality control and adapter trimming when --trim_fastq
is specified.
Alignment
BWA-MEM / BWA-MEM2 / STAR
Output files
alignment/
*.bam
: Aligned reads in BAM format.*.bai
: BAM index files.*.cram
: Aligned reads in CRAM format (if--save_output_as_bam
is false).*.crai
: CRAM index files.
The pipeline supports multiple aligners:
- BWA-MEM for DNA alignment
- BWA-MEM2 for faster DNA alignment
- STAR for RNA alignment
- DRAGMAP for high-accuracy alignment
Alignment files are saved in BAM or CRAM format depending on the --save_output_as_bam
parameter.
Alignment QC
Output files
alignment/samtools_stats/
*.stats
: Samtools stats output with alignment statistics.
alignment/mosdepth/
*.mosdepth.global.dist.txt
: Global coverage distribution.*.mosdepth.summary.txt
: Coverage summary statistics.
Quality control metrics for aligned reads are generated using samtools stats and mosdepth.
Preprocessing
GATK Preprocessing
Output files
preprocessing/
markduplicates/
*.bam
: BAM files with duplicates marked.*.bai
: BAM index files.*.metrics
: Duplicate marking metrics.
splitncigarreads/
(RNA samples only)*.bam
: BAM files with split N CIGAR reads.*.bai
: BAM index files.
baserecalibrator/
*.table
: Base quality score recalibration tables.
applybqsr/
*.bam
: BAM files with recalibrated base quality scores.*.bai
: BAM index files.
The pipeline follows GATK best practices for preprocessing:
- MarkDuplicates: Identifies and marks duplicate reads
- SplitNCigarReads: Splits reads with N CIGAR operations (RNA-seq specific)
- BaseRecalibrator: Generates base quality score recalibration table
- ApplyBQSR: Applies base quality score recalibration
Variant Calling
Mutect2
Output files
variant_calling/mutect2/
*.vcf.gz
: Variant calls in VCF format.*.vcf.gz.tbi
: VCF index files.*.stats
: Mutect2 statistics.*.f1r2.tar.gz
: F1R2 counts for orientation bias filtering.
Mutect2 is GATK’s somatic variant caller for detecting SNVs and indels in tumor samples.
Strelka2
Output files
variant_calling/strelka/
*.somatic.snvs.vcf.gz
: Somatic SNV calls.*.somatic.indels.vcf.gz
: Somatic indel calls.*.vcf.gz.tbi
: VCF index files.
Strelka2 is a fast and accurate small variant caller optimized for analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs.
SAGE
Output files
variant_calling/sage/
*.vcf.gz
: Variant calls in VCF format.*.vcf.gz.tbi
: VCF index files.
SAGE is a precise and highly sensitive somatic SNV, MNV and INDEL caller.
Annotation
VEP Annotation
Output files
annotation/vep/
*.vcf.gz
: Annotated variants in VCF format.*.vcf.gz.tbi
: VCF index files.*.html
: VEP summary report.
Variant Effect Predictor (VEP) determines the effect of variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.
VCF to MAF Conversion
Output files
annotation/vcf2maf/
*.maf
: Variants in Mutation Annotation Format (MAF).
Variants are converted from VCF to MAF format using vcf2maf for downstream analysis and visualization.
Consensus
Consensus Calling
Output files
consensus/
*.consensus.vcf
: Consensus variants across all callers in VCF format.*.consensus.maf
: Consensus variants in MAF format.*.consensus_*.vcf
: Caller-specific consensus variants.*.consensus_*.maf
: Caller-specific consensus variants in MAF format.*.pdf
: Consensus analysis plots and visualizations.
The consensus module combines variant calls from multiple callers (Mutect2, Strelka2, SAGE) to improve variant calling accuracy and reduce false positives.
Filtering
MAF Filtering
Output files
filtering/
*.filtered.maf
: Filtered variants in MAF format.
Variants are filtered based on various criteria including:
- Population frequency (gnomAD) (controlled by config of
maf_filtering
module) - Whitelist/blacklist regions (controlled by parameters
whitelist
andblacklist
) - Quality metrics
- Custom filtering parameters (controlled by config of
maf_filtering
module)
RNA-specific Filtering
Output files
filtering/rna/
*.rna_filtered.maf
: RNA-specific filtered variants in MAF format.
Additional filtering steps specific to RNA-seq data to account for RNA editing, splicing artifacts, and other RNA-specific noise.
Realignment
RNA Realignment
Output files
realignment/
*.realigned.bam
: Realigned BAM files for variant regions.*.realigned.bai
: BAM index files.*.realigned.maf
: Variants from realigned regions.
Optional realignment step using HISAT2 for RNA samples in regions where variants were detected. This helps improve variant calling accuracy in RNA-seq data by addressing alignment artifacts.
Normalization
VT Normalization
Output files
normalization/
*.normalized.vcf.gz
: Normalized variants in VCF format.*.stats
: Normalization statistics.
Variants are normalized using vt decompose and normalize to ensure consistent representation of variants across different callers.
MultiQC
Output files
multiqc/
multiqc_report.html
: a standalone HTML file that can be viewed in your web browser.multiqc_data/
: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/
: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Pipeline information
Output files
pipeline_info/
execution_report_*.html
: Report describing the pipeline run.execution_timeline_*.html
: Timeline of the pipeline execution.execution_trace_*.txt
: Trace file with detailed execution information.pipeline_dag_*.html
: DAG visualization of the pipeline.software_versions.yml
: Software versions used in the pipeline.params_*.json
: Parameters used for the pipeline run.
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.