nf-core/rnadnavar
Edit

Pipeline for RNA and DNA integrated analysis for somatic mutation detection

This is the development version of the pipeline.

Launch development version https://github.com/nf-core/rnadnavar

Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarizes results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Quality Control - Raw read QC and trimming, FastQC and fastp reports for input data assessment
Alignment - Read alignment to reference genome (BAM/CRAM)
Preprocessing - GATK preprocessing steps following best practices
Variant Calling - Somatic variant detection, VCF files from multiple callers (Mutect2, Strelka2, SAGE).
Annotation - Variant annotation with VEP
Consensus - Consensus variant calling across multiple callersin MAF format
Filtering - Variant filtering and quality control in MAF format
Realignment - Optional RNA-specific realignment
MultiQC - Aggregate report describing results and QC
Pipeline information - Report metrics generated during the workflow execution

Quality Control

FastQC

Output files

fastqc/
- *_fastqc.html: FastQC report containing quality metrics.
- *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.

fastp

Output files

fastp/
- *.fastp.html: fastp report in HTML format.
- *.fastp.json: fastp report in JSON format.
- *.fastp.log: fastp log file.
- *.fastp.fastq.gz: Trimmed FastQ files (if trimming is enabled).

fastp is a tool designed to provide fast, all-in-one preprocessing for FastQ files. It is developed in C++ with multithreading supported to afford high performance. fastp is used in this pipeline for quality control and adapter trimming when --trim_fastq is specified.

Alignment

BWA-MEM / BWA-MEM2 / STAR

Output files

alignment/
- *.bam: Aligned reads in BAM format.
- *.bai: BAM index files.
- *.cram: Aligned reads in CRAM format (if --save_output_as_bam is false).
- *.crai: CRAM index files.

The pipeline supports multiple aligners:

BWA-MEM for DNA alignment
BWA-MEM2 for faster DNA alignment
STAR for RNA alignment
DRAGMAP for high-accuracy alignment

Alignment files are saved in BAM or CRAM format depending on the --save_output_as_bam parameter.

Alignment QC

Output files

alignment/samtools_stats/
- *.stats: Samtools stats output with alignment statistics.
alignment/mosdepth/
- *.mosdepth.global.dist.txt: Global coverage distribution.
- *.mosdepth.summary.txt: Coverage summary statistics.

Quality control metrics for aligned reads are generated using samtools stats and mosdepth.

Preprocessing

GATK Preprocessing

Output files

preprocessing/
- markduplicates/
  - *.bam: BAM files with duplicates marked.
  - *.bai: BAM index files.
  - *.metrics: Duplicate marking metrics.
- splitncigarreads/ (RNA samples only)
  - *.bam: BAM files with split N CIGAR reads.
  - *.bai: BAM index files.
- baserecalibrator/
  - *.table: Base quality score recalibration tables.
- applybqsr/
  - *.bam: BAM files with recalibrated base quality scores.
  - *.bai: BAM index files.

The pipeline follows GATK best practices for preprocessing:

MarkDuplicates: Identifies and marks duplicate reads
SplitNCigarReads: Splits reads with N CIGAR operations (RNA-seq specific)
BaseRecalibrator: Generates base quality score recalibration table
ApplyBQSR: Applies base quality score recalibration

Variant Calling

Mutect2

Output files

variant_calling/mutect2/
- *.vcf.gz: Variant calls in VCF format.
- *.vcf.gz.tbi: VCF index files.
- *.stats: Mutect2 statistics.
- *.f1r2.tar.gz: F1R2 counts for orientation bias filtering.

Mutect2 is GATK’s somatic variant caller for detecting SNVs and indels in tumor samples.

Strelka2

Output files

variant_calling/strelka/
- *.somatic.snvs.vcf.gz: Somatic SNV calls.
- *.somatic.indels.vcf.gz: Somatic indel calls.
- *.vcf.gz.tbi: VCF index files.

Strelka2 is a fast and accurate small variant caller optimized for analysis of germline variation in small cohorts and somatic variation in tumor/normal sample pairs.

SAGE

Output files

variant_calling/sage/
- *.vcf.gz: Variant calls in VCF format.
- *.vcf.gz.tbi: VCF index files.

SAGE is a precise and highly sensitive somatic SNV, MNV and INDEL caller.

Annotation

VEP Annotation

Output files

annotation/vep/
- *.vcf.gz: Annotated variants in VCF format.
- *.vcf.gz.tbi: VCF index files.
- *.html: VEP summary report.

Variant Effect Predictor (VEP) determines the effect of variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.

VCF to MAF Conversion

Output files

annotation/vcf2maf/
- *.maf: Variants in Mutation Annotation Format (MAF).

Variants are converted from VCF to MAF format using vcf2maf for downstream analysis and visualization.

Consensus

Consensus Calling

Output files

consensus/
- *.consensus.vcf: Consensus variants across all callers in VCF format.
- *.consensus.maf: Consensus variants in MAF format.
- *.consensus_*.vcf: Caller-specific consensus variants.
- *.consensus_*.maf: Caller-specific consensus variants in MAF format.
- *.pdf: Consensus analysis plots and visualizations.

The consensus module combines variant calls from multiple callers (Mutect2, Strelka2, SAGE) to improve variant calling accuracy and reduce false positives.

Filtering

MAF Filtering

Output files

filtering/
- *.filtered.maf: Filtered variants in MAF format.

Variants are filtered based on various criteria including:

Population frequency (gnomAD) (controlled by config of maf_filtering module)
Whitelist/blacklist regions (controlled by parameters whitelist and blacklist)
Quality metrics
Custom filtering parameters (controlled by config of maf_filtering module)

RNA-specific Filtering

Output files

filtering/rna/
- *.rna_filtered.maf: RNA-specific filtered variants in MAF format.

Additional filtering steps specific to RNA-seq data to account for RNA editing, splicing artifacts, and other RNA-specific noise.

Realignment

RNA Realignment

Output files

realignment/
- *.realigned.bam: Realigned BAM files for variant regions.
- *.realigned.bai: BAM index files.
- *.realigned.maf: Variants from realigned regions.

Optional realignment step using HISAT2 for RNA samples in regions where variants were detected. This helps improve variant calling accuracy in RNA-seq data by addressing alignment artifacts.

Normalization

VT Normalization

Output files

normalization/
- *.normalized.vcf.gz: Normalized variants in VCF format.
- *.stats: Normalization statistics.

Variants are normalized using vt decompose and normalize to ensure consistent representation of variants across different callers.

MultiQC

Output files

multiqc/
- multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
- multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
- multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files

pipeline_info/
- execution_report_*.html: Report describing the pipeline run.
- execution_timeline_*.html: Timeline of the pipeline execution.
- execution_trace_*.txt: Trace file with detailed execution information.
- pipeline_dag_*.html: DAG visualization of the pipeline.
- software_versions.yml: Software versions used in the pipeline.
- params_*.json: Parameters used for the pipeline run.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

nf-core/rnadnavar Edit

Introduction

Pipeline overview

Quality Control

FastQC

fastp

Alignment

BWA-MEM / BWA-MEM2 / STAR

Alignment QC

Preprocessing

GATK Preprocessing

Variant Calling

Mutect2

Strelka2

SAGE

Annotation

VEP Annotation

VCF to MAF Conversion

Consensus

Consensus Calling

Filtering

MAF Filtering

RNA-specific Filtering

Realignment

RNA Realignment

Normalization

VT Normalization

MultiQC

Pipeline information

nf-core/rnadnavar
Edit