Nextflow Pipeline for ZipStrain
This page reflects the current zipstrain.nf workflow in this repository.
What the Pipeline Supports
- Read mapping with Bowtie2 (
map_reads) - Profile generation from BAM files (
profile) - End-to-end SRA to profile (
from_sra_to_profile) - Pairwise genome comparison across profiles (
compare_genomes) - Pairwise gene comparison across profiles (
compare_genes)
Running Pattern
nextflow run zipstrain.nf \
--mode <mode> \
--input_table <path/to/input.csv> \
--output_dir <path/to/output_dir> \
-c conf.config \
-profile <docker|alpine|gutbot|blanca> \
-resume
conf.config already defines resources for the current process set and includes example execution profiles.
Key Pipeline Parameters
--mode:map_reads,profile,from_sra_to_profile,compare_genomes,compare_genes--input_type: depends on mode (local,sra,profile_table,pair_table)--parallel_mode:singleorbatchedfor comparison workflows--batch_size: number of pairs per batch when--parallel_mode batched--batch_compare_n_parallel: parallel jobs inside each batched comparison task--compare_genome_scope: genome scope for genome comparisons (allor genome ID)--compare_gene_scope: gene scope for gene comparisons (all:all,<genome>:all,all:<gene>,<genome>:<gene>)--compare_duckdb_memory_limit: forwarded to single compare commands--compare_calculate: genome metrics for genome compare (ani,ibs,identical_genes,all, or+combinations). Default:all
1) Map Reads (mode=map_reads)
Input Table (--input_type local)
Paired-end:
sample_name,reads1,reads2
S1,/data/S1_R1.fastq.gz,/data/S1_R2.fastq.gz
S2,/data/S2_R1.fastq.gz,/data/S2_R2.fastq.gz
Single-end:
sample_name,reads1
S1,/data/S1.fastq.gz
S2,/data/S2.fastq.gz
Input Table (--input_type sra)
Run
SRR12345678
SRR12345679
A) Use Existing Reference Genome
nextflow run zipstrain.nf \
--mode map_reads \
--input_type local \
--input_table reads.csv \
--reference_genome reference_genomes.fna \
--stb reference_genomes.stb \
--output_dir out_map \
-c conf.config \
-profile docker \
-resume
Optional:
--index_filesto reuse existing Bowtie2 index files--bowtie2_non_competitive_mapping trueto pass-ato Bowtie2
B) Build Reference from Sylph Automatically
If --reference_genome is not provided, the pipeline does:
- per-sample
sylph profile - merge all per-sample Sylph abundance tables
zipstrain utilities build-genome-db --tool sylph ...prodigalgene prediction on the generated reference FASTA- Bowtie2 indexing
- mapping
nextflow run zipstrain.nf \
--mode map_reads \
--input_type local \
--input_table reads.csv \
--output_dir out_map \
--genome_db_cache_dir genome_cache \
--sylph_db /path/to/custom.syldb \
-c conf.config \
-profile docker \
-resume
If --sylph_db is omitted, --sylph_db_link is used for download.
Map Outputs
- BAM files:
<output_dir>/*.bam - Sylph tables:
<output_dir>/sylph_abundance/ - Built reference bundle (when auto-built):
<output_dir>/db_from_sylph/
2) Generate Profiles from BAM (mode=profile)
Input Table
sample_name,bamfile
S1,/data/S1.bam
S2,/data/S2.bam
Command
nextflow run zipstrain.nf \
--mode profile \
--input_table bams.csv \
--reference_genome reference_genomes.fna \
--gene_file reference_genomes_gene.fasta \
--stb reference_genomes.stb \
--output_dir out_profile \
-c conf.config \
-profile docker \
-resume
Profile Outputs
<output_dir>/*_profile.parquet<output_dir>/*_genome_stats.parquet<output_dir>/*_gene_stats.parquet
3) End-to-End SRA to Profile (mode=from_sra_to_profile)
Input Table
Run
SRR12345678
SRR12345679
Command
nextflow run zipstrain.nf \
--mode from_sra_to_profile \
--input_table sra.csv \
--reference_genome reference_genomes.fna \
--gene_file reference_genomes_gene.fasta \
--stb reference_genomes.stb \
--output_dir out_sra_profile \
-c conf.config \
-profile docker \
-resume
Outputs
<output_dir>/profiles/*_profile.parquet<output_dir>/profiles/*_genome_stats.parquet<output_dir>/profiles/*_gene_stats.parquet
4) Compare Genomes (mode=compare_genomes)
Input Option A: All-vs-All from Profile List (--input_type profile_table)
sample_names,mpileup_files
S1,/profiles/S1_profile.parquet
S2,/profiles/S2_profile.parquet
S3,/profiles/S3_profile.parquet
Input Option B: Explicit Pairs (--input_type pair_table)
sample_name_1,sample_name_2,profile_location_1,profile_location_2
S1,S2,/profiles/S1_profile.parquet,/profiles/S2_profile.parquet
S1,S3,/profiles/S1_profile.parquet,/profiles/S3_profile.parquet
Command
nextflow run zipstrain.nf \
--mode compare_genomes \
--input_type profile_table \
--input_table profiles.csv \
--stb reference_genomes.stb \
--compare_genome_scope all \
--compare_calculate ani+ibs+identical_genes \
--parallel_mode batched \
--batch_size 1000 \
--batch_compare_n_parallel 4 \
--compare_duckdb_memory_limit 4GB \
--output_dir out_compare_genomes \
-c conf.config \
-profile docker \
-resume
5) Compare Genes (mode=compare_genes)
Input-table formats are the same as genome compare (profile_table or pair_table).
Command
nextflow run zipstrain.nf \
--mode compare_genes \
--input_type profile_table \
--input_table profiles.csv \
--stb reference_genomes.stb \
--compare_gene_scope all:all \
--parallel_mode batched \
--batch_size 1000 \
--batch_compare_n_parallel 4 \
--compare_duckdb_memory_limit 4GB \
--output_dir out_compare_genes \
-c conf.config \
-profile docker \
-resume
Comparison Outputs
- Final merged table:
<output_dir>/merged_comparisons.parquet - Intermediate batched outputs (when
parallel_mode=batched):<output_dir>/batch_comparisons/
Important Notes
- The old
--compare_memory_modeand--compare_chrom_batch_sizeparameters are not part of the currentzipstrain.nf. - The pipeline currently forwards DuckDB memory limit via
--compare_duckdb_memory_limitbut does not expose compare engine/threads as Nextflow params in this script. - For auto-built references, genome selection comes from the merged Sylph abundance table through
zipstrain utilities build-genome-db.