CPG Bioinformatics Nepal Training Feb 2026

Logo

Hands-on training for analysing Oxford Nanopore Technologies (ONT) sequencing data for microbial pathogen detection

Tutorial 1: Quality control of Nanopore Reads, processing of reads, and taxonomic profiling

Learn to assess and visualise the quality of Oxford Nanopore sequencing data using FastQC, MultiQC, and NanoPlot and perform adapter trimming and quality filtering with Porechop and Fastp.

Prerequisites

Learning objectives

By the end of this tutorial, you will:


Step 1: Verify your data and prepare the sample paths file

If you have not downloaded the data yet, please follow the Data download guide before proceeding.

# Navigate to project directory
cd ~/nanopore_training

# Check your data files
ls -lh data/raw_reads/

# rename the data files
mv data/raw_reads/Barcode10_Spike2.fastq.gz data/raw_reads/barcode10.fastq.gz
mv data/raw_reads/Barcode11_Spike2b.fastq.gz data/raw_reads/barcode11.fastq.gz

# create the tab-separated sample paths file
# first create using Excel
ls $PWD/data/raw_reads/*.fastq.gz

# copy the path to the Excel file and create a tab-separated file with the sample name and path
nano reads_paths.tab
# paste the content into reads_paths.tab

# EASY WAY: create the sample paths file using command line

ls $PWD/data/raw_reads/*.fastq.gz | awk -F'/' '{ split($NF, a, "."); print a[1] "\t" $0 }' > reads_paths.tab

# Verify reads_paths.tab exists
cat reads_paths.tab

Step 2: FastQC - Standard Quality Metrics

What is FastQC?

Run FastQC

# Create output directory
mkdir -p analysis/qc/fastqc_raw/

# Run one at a time first

fastqc data/raw_reads/barcode10.fastq.gz -o analysis/qc/fastqc_raw/ -t 8
fastqc data/raw_reads/barcode11.fastq.gz -o analysis/qc/fastqc_raw/ -t 8


# Run FastQC on both samples in parallel
cat reads_paths.tab  | parallel -j 1 --colsep '\t' 'fastqc {2} -o analysis/qc/fastqc_raw -t 8'

Command explanation:

View FastQC results

# List generated files
ls -lh analysis/qc/fastqc_raw/

# Open HTML report in browser (WSL/Linux)
firefox analysis/qc/fastqc_raw/Barcode10_Spike2_fastqc.html &
# Or for WSL: explorer.exe analysis/qc/fastqc_raw/Barcode10_Spike2_fastqc.html

Key metrics to check:

For nanopore data:


Step 3: MultiQC - aggregate reports

What is MultiQC?

Run MultiQC

# Aggregate all FastQC outputs
cd analysis/qc/
multiqc fastqc_raw/ -o multiqc_raw/


# Return to project root
cd ~/nanopore_training

Command explanation:

View MultiQC report

# Open MultiQC report
firefox analysis/qc/multiqc_raw/multiqc_report.html &

What to look for:


Step 4 [OPTIONAL - Only if you have time]: NanoPlot - Nanopore-Specific QC

What is NanoPlot?

Run NanoPlot

# Create output directory
mkdir -p analysis/qc/nanoplot_raw/

# Run NanoPlot on each sample
cat reads_paths.tab \
  | parallel -j 1 --colsep '\t' \
    'NanoPlot \
       --fastq {2} \
       --outdir analysis/qc/nanoplot_raw/{1} \
       --prefix {1}_ \
       --threads 8 \
       --plots dot kde'

Command explanation:

View NanoPlot Results

# Check generated files for one sample
ls -lh analysis/qc/nanoplot_raw/barcode10/

# Open NanoPlot HTML report
firefox analysis/qc/nanoplot_raw/barcode10/barcode10_NanoPlot-report.html &

Key NanoPlot metrics:

Metric What it means Good values (for bacteria)
Number of reads Total sequencing output 50,000+
Mean read length Average read size 2,000-5,000 bp
Mean read quality Average Q score Q10+ acceptable, Q12+ good
N50 50% of bases in reads ≥ this length 3,000+ bp
Total bases Total sequencing output 200 Mb+

Step 5: Porechop - adapter trimming [Takes time]

What is Porechop?

Why remove adapters?

Run Porechop

# Create output directory
mkdir -p analysis/qc/porechop/

# Run one sample first to understand the process
porechop -i data/raw_reads/barcode10.fastq.gz -o analysis/qc/porechop/barcode10_trimmed.fastq.gz -t 8
porechop -i data/raw_reads/barcode11.fastq.gz -o analysis/qc/porechop/barcode11_trimmed.fastq.gz -t 8


# Run Porechop on all samples in parallel
# cat reads_paths.tab | parallel -j 1 --colsep '\t' 'porechop -i {2} -o analysis/qc/porechop/{1}_trimmed.fastq.gz -t 8'
# parallel --progress : Show progress bar (optional, may slow down processing)
# parallel --line-buffer : Buffer output line by line (optional, may slow down processing)

Command explanation:

porechop \
  -i input.fastq.gz           # Input FASTQ file
  -o output_trimmed.fastq.gz  # Output file (automatically compressed)
  -t 8                        # Number of threads

Default Porechop behavior:

Check Porechop Output

# View Porechop statistics (printed to terminal)
# Look for lines like:
# "Trimming adapters from read ends"
# "Splitting reads with internal adapters"
# "X reads had adapters trimmed from their start"
# "Y reads had adapters trimmed from their end"

# Count reads before and after
echo "=== Porechop Results ==="
for sample in barcode10 barcode11; do
  echo "Sample: $sample"
  raw=$(zcat data/raw_reads/${sample}.fastq.gz | echo $((`wc -l`/4)))
  trimmed=$(zcat analysis/qc/porechop/${sample}_trimmed.fastq.gz | echo $((`wc -l`/4)))
  echo "  Raw reads: $raw"
  echo "  Trimmed reads: $trimmed"
  echo "  Retained: $(echo "scale=1; $trimmed*100/$raw" | bc)%"
  echo ""
done

# List trimmed files
ls -lh analysis/qc/porechop/

# Compare file sizes (trimmed should be slightly smaller)
echo "=== File size comparison ==="
echo "Raw reads:"
du -h data/raw_reads/*.fastq.gz
echo ""
echo "Trimmed reads:"
du -h analysis/qc/porechop/*.fastq.gz

What to expect:


Step 6: Fastp - Quality filtering and length filtering

What is Fastp?

Why quality filter?

Run Fastp

# Create output directory
mkdir -p analysis/qc/fastp/

# Run one sample first
fastp \
  -i analysis/qc/porechop/barcode10_trimmed.fastq.gz \
  -o analysis/qc/fastp/barcode10_filtered.fastq.gz \
  --qualified_quality_phred 10 \
  --disable_trim_poly_g \
  --thread 8 \
  --html analysis/qc/fastp/barcode10_fastp.html \
  --json analysis/qc/fastp/barcode10_fastp.json

# Run on all samples in parallel
cat reads_paths.tab \
  | parallel -j 1 --colsep '\t' \
    'fastp \
       -i analysis/qc/porechop/{1}_trimmed.fastq.gz \
       -o analysis/qc/fastp/{1}_filtered.fastq.gz \
       --qualified_quality_phred 10 \
       --disable_trim_poly_g \
       --thread 8 \
       --html analysis/qc/fastp/{1}_fastp.html \
       --json analysis/qc/fastp/{1}_fastp.json'

Command explanation:

fastp \
  -i input.fastq.gz                    # Input file
  -o output_filtered.fastq.gz          # Output file
  --qualified_quality_phred 10         # Minimum quality score (Q10)
  --disable_trim_poly_g                # Disable poly-G trimming (not needed for nanopore)
  --thread 8                           # Number of threads
  --html report.html                   # HTML report
  --json report.json                   # JSON report (for MultiQC)

Adjust parameters based on your needs:

View Fastp Reports

# Open HTML report in browser
firefox analysis/qc/fastp/barcode10_fastp.html &

# Or list all reports
ls -lh analysis/qc/fastp/*.html

What to look for in Fastp report:

Before filtering section:

After filtering section:

Filtering result:

Check Fastp Statistics

echo "=== Fastp Filtering Results ==="
for sample in barcode10 barcode11; do
  echo "Sample: $sample"
  
  # Extract statistics from JSON
  # total_reads=$(grep '"total_reads"' analysis/qc/fastp/${sample}_fastp.json | head -1 | awk '{print $2}' | tr -d ',')
  total_reads=$(grep '"total_reads"' analysis/qc/fastp/${sample}_fastp.json | head -1 | awk -F':' '{print $2}' | tr -d ' ,')
  filtered_reads=$(grep '"total_reads"' analysis/qc/fastp/${sample}_fastp.json | tail -1 | awk '{print $2}' | tr -d ',')
  
  echo "  Before filtering: $total_reads reads"
  echo "  After filtering: $filtered_reads reads"
  echo "  Retained: $(echo "scale=1; $filtered_reads*100/$total_reads" | bc)%"
  echo ""
done

Expected retention:


Step 7: FastQC on processed reads [Challenge]

Now let’s check the quality of our processed reads to verify improvement.

# Create output directory
mkdir -p analysis/qc/fastqc_filtered/

# Run FastQC on filtered reads
cat reads_paths.tab \
  | parallel -j 1 --colsep '\t' \
    'fastqc analysis/qc/fastp/{1}_filtered.fastq.gz \
       -o analysis/qc/fastqc_filtered -t 8'

Compare Raw vs Filtered Quality

# Open both reports side-by-side
firefox analysis/qc/fastqc_raw/barcode10_fastqc.html \
        analysis/qc/fastqc_filtered/barcode10_filtered_fastqc.html &

Step 8: Comprehensive MultiQC Report

Aggregate all QC reports (raw, Porechop, Fastp, filtered) into one comprehensive report.

# Create comprehensive MultiQC report
cd analysis/qc/
multiqc . -o multiqc_comprehensive/ --force

# Return to project root
cd ~/nanopore_training

This MultiQC report includes:

View Comprehensive Report

# Open comprehensive MultiQC report
firefox analysis/qc/multiqc_comprehensive/multiqc_report.html &

Key sections to review:

General Statistics table:

FastQC sections:

Fastp section:


Decision Tree: Proceed or Re-process?

Quality Check After Processing:
    ├─ ≥70% reads retained → ✓ Good, proceed to next step
    ├─ 50-70% retained → ⚠ Acceptable, note in report
    ├─ <50% retained → ✗ Poor quality
    │   └─ Options:
    │       ├─ Relax filtering parameters
    │       ├─ Check raw data quality
    │       └─ Consider re-sequencing

Read Quality After Filtering:
    ├─ Mean Q ≥ 10 → ✓ Good
    ├─ Mean Q 7-10 → ⚠ Acceptable
    └─ Mean Q < 7 → ✗ Poor quality, reconsider parameters

Read Length After Filtering:
    ├─ Mean ≥ 2kb → ✓ Excellent for assembly
    ├─ Mean 1-2kb → ✓ Good for most analyses
    ├─ Mean 500-1kb → ⚠ OK for taxonomy, challenging for assembly
    └─ Mean < 500bp → ✗ Too short for bacterial genomics

Step 9: Taxonomic profiling with Kraken2 (BabyKraken database)

What is Kraken2?

Why start with BabyKraken?

Run Kraken2 with BabyKraken

# Create output directory
mkdir -p analysis/kraken2/babykraken/

# Run Kraken2 on filtered reads
cat reads_paths.tab \
  | parallel -j 1 --colsep '\t' \
    'kraken2 \
       --db databases/kraken2/babykraken \
       --threads 8 \
       --report analysis/kraken2/babykraken/{1}_report.txt \
       --output analysis/kraken2/babykraken/{1}_output.txt \
       --memory-mapping \
       analysis/qc/fastp/{1}_filtered.fastq.gz'

Command explanation:

Check Kraken2 output

# List generated files
ls -lh analysis/kraken2/babykraken/

# View classification summary
echo "=== Kraken2 BabyKraken Classification Summary ==="
for sample in barcode10 barcode11; do
  echo ""
  echo "Sample: $sample"
  head -20 analysis/kraken2/babykraken/${sample}_report.txt
done

Understanding the Kraken2 report:

Report columns:

What to look for:


Step 10: Krona visualisation (BabyKraken Results)

What is Krona?

Generate Krona plots

# Create output directory
mkdir -p analysis/krona/

# Convert Kraken2 reports to Krona HTML
cat reads_paths.tab \
  | parallel -j 1 --colsep '\t' \
    'ktImportTaxonomy \
       -q 2 -t 3 \
       analysis/kraken2/babykraken/{1}_output.txt \
       -o analysis/krona/{1}_babykraken_krona.html'

Command explanation:

View Krona visualizations

# Open Krona HTML in browser
firefox analysis/krona/barcode10_babykraken_krona.html &
firefox analysis/krona/barcode11_babykraken_krona.html &

How to use Krona:

Interpretation tips:


Step 11: Taxonomic Profiling with Kraken2 (standard database) [takes a lot of time - run at your free time]

Now let’s run the same samples with the more comprehensive standard Kraken2 database.

Standard database advantages:

Run Kraken2 with standard database

# Create output directory
mkdir -p analysis/kraken2/standard/

# Run Kraken2 on filtered reads
cat reads_paths.tab \
  | parallel -j 1 --colsep '\t' \
    'kraken2 \
       --db databases/kraken2/standard_16gb \
       --threads 8 \
       --report analysis/kraken2/standard/{1}_report.txt \
       --output analysis/kraken2/standard/{1}_output.txt \
       --memory-mapping \
       analysis/qc/fastp/{1}_filtered.fastq.gz'

Check standard database output

# View classification summary
echo "=== Kraken2 Standard Database Classification Summary ==="
for sample in barcode10 barcode11; do
  echo ""
  echo "Sample: $sample"
  head -20 analysis/kraken2/standard/${sample}_report.txt
done

Step 12: Krona visualisation and database comparison

Generate Krona plots for Standard database results

# Convert Kraken2 Standard reports to Krona HTML
cat reads_paths.tab \
    | parallel -j 1 --colsep '\t' \
        'ktImportTaxonomy \
             -q 2 -t 3 \
             analysis/kraken2/standard/{1}_output.txt \
             -o analysis/krona/{1}_standard_krona.html'

View and compare visualisations

# Open both Krona visualisations side by side
firefox analysis/krona/barcode10_babykraken_krona.html &
firefox analysis/krona/barcode10_standard_krona.html &

Systematic comparison of database results

# Create a comparison summary
echo "=== Database Comparison Summary ===" > analysis/kraken2/database_comparison.txt

for sample in barcode10 barcode11; do
    echo "" >> analysis/kraken2/database_comparison.txt
    echo "================================" >> analysis/kraken2/database_comparison.txt
    echo "Sample: $sample" >> analysis/kraken2/database_comparison.txt
    echo "================================" >> analysis/kraken2/database_comparison.txt
    
    echo "" >> analysis/kraken2/database_comparison.txt
    echo "--- BabyKraken ---" >> analysis/kraken2/database_comparison.txt
    head -15 analysis/kraken2/babykraken/${sample}_report.txt >> analysis/kraken2/database_comparison.txt
    
    echo "" >> analysis/kraken2/database_comparison.txt
    echo "--- Standard Database ---" >> analysis/kraken2/database_comparison.txt
    head -15 analysis/kraken2/standard/${sample}_report.txt >> analysis/kraken2/database_comparison.txt
done

# View comparison
cat analysis/kraken2/database_comparison.txt

Extract top species from both databases

# Extract species-level classifications
echo "=== Top Species Identified ===" 

for sample in barcode10 barcode11; do
    echo ""
    echo "Sample: $sample"
    echo ""
    echo "BabyKraken top species:"
    grep -P "\sS\s" analysis/kraken2/babykraken/${sample}_report.txt | sort -k1 -nr | head -5
    echo ""
    echo "Standard database top species:"
    grep -P "\sS\s" analysis/kraken2/standard/${sample}_report.txt | sort -k1 -nr | head -5
done

Expected differences between databases

Aspect BabyKraken Standard Database
Database size ~10 MB ~16 GB
Species coverage Limited, common organisms Comprehensive, including rare species
Classification rate Lower (more unclassified) Higher (fewer unclassified)
Processing speed Faster Slower
Memory usage Lower (~1-2 GB) Higher (~16 GB)
Best for Quick screening, testing Production analyses, publications

Additional Resources