ZipStrain Command Line Interface
This page is the command reference. If you want end-to-end examples first, use the Tutorial, which includes:
- a standard workflow using the Python CLI
- a standard workflow using Nextflow
- a matrix workflow for repeated all-vs-all comparison
This page is organized by workflow area for easier navigation:
General usage:
zipstrain --help
For command-specific help:
zipstrain <command-or-group> --help
zipstrain <group> <command> --help
Profile
Profile Commands At A Glance
| Command | Purpose |
|---|---|
zipstrain profile |
Batch profiling for multiple BAM files |
zipstrain utilities prepare_profiling |
Build profiling assets (BED, gene ranges, genome lengths, null model, profiling contract) |
zipstrain utilities profile-single |
Profile one BAM file |
zipstrain profile
Run BAM profiling in batch mode.
zipstrain profile \
--input-table samples.csv \
--stb-file mapping.stb \
--null-model profiling_assets/null_model.parquet \
--profiling-contract profiling_assets/profiling_contract.json \
--bed-file profiling_assets/genomes_bed_file.bed \
--genome-length-file profiling_assets/genome_lengths.parquet \
--run-dir profile_run
Options:
-i, --input-table(required)-s, --stb-file(required)-u, --null-model(required)-g, --gene-range-table(optional)--profiling-contract(optional)-b, --bed-file(required)-l, --genome-length-file(required)-r, --run-dir(required)-n, --num-procs(default:8)-m, --max-concurrent-batches(default:5)-p, --poll-interval(default:1)-e, --execution-mode(default:local)-c, --slurm-config-o, --container-engine(default:local)--container-address(optional) — explicit image/address override fordockerorapptainer; otherwise the current ZipStrain version tag is used-t, --task-per-batch(default:10)--min-mapq(default:0)--min-baseq(default:13)--min-read-ani(optional) — filters reads before pileup using the BAMNMtag and aligned query span--read-inclusion(proper-pairs|paired|all-mapped, default:all-mapped)
Read-filter notes:
--min-mapq 0and--min-baseq 13match current samtools mpileup defaults--read-inclusion all-mappedis the least restrictive mode and is the default--min-read-anirequires BAM alignments withNMtags
zipstrain utilities prepare_profiling
Prepare profiling database assets.
zipstrain utilities prepare_profiling \
--reference-fasta reference.fasta \
--stb-file mapping.stb \
--output-dir profiling_assets
Options:
-r, --reference-fasta(required)-g, --gene-fasta(optional)-s, --stb-file(required)-e, --error-rate(default:0.001)-m, --max-total-reads(default:10000)-p, --p-threshold(default:0.05)-t, --model-type(default:poisson)-o, --output-dir(required)
Outputs:
genomes_bed_file.bedgene_range_table.tsvgenome_lengths.parquetnull_model.parquetprofiling_contract.json
zipstrain utilities profile-single
Profile a single BAM.
zipstrain utilities profile-single \
--bed-file genomes_bed_file.bed \
--bam-file sample.bam \
--stb-file mapping.stb \
--null-model profiling_assets/null_model.parquet \
--profiling-contract profiling_assets/profiling_contract.json \
--num-chunks 24 \
--max-concurrency 4 \
--output-dir sample_profile
Options:
-r, --reference-fasta(optional) — when provided, profiling also recordsref_base_bitmaskand addsref_anito gene/genome stat tables-b, --bed-file(required)-a, --bam-file(required)-s, --stb-file(required)-m, --null-model(required)-g, --gene-range-table(optional)--profiling-contract(optional)-n, --num-chunks(default:24) — number of BED chunks to create-c, --max-concurrency(default:4) — how many chunks run simultaneously--min-mapq(default:0)--min-baseq(default:13)--min-read-ani(optional) — filters reads before pileup using the BAMNMtag and aligned query span--read-inclusion(proper-pairs|paired|all-mapped, default:all-mapped)-o, --output-dir(required)
Read-inclusion modes:
proper-pairs: keep only mapped read pairs carrying the alignerPROPER_PAIRflagpaired: keep mapped paired-end reads even if they are discordantall-mapped: keep any mapped read, whether paired or single-end
Outputs include:
<sample>_profile.parquet<sample>_genome_stats.parquet<sample>_gene_stats.parquet
When --reference-fasta is provided during profiling, the profile parquet includes ref_base_bitmask.
In the same case, the generated genome and gene stat tables also include a ref_ani column.
ref_ani is the percentage of covered sites whose observed allele set still contains the reference allele after ZipStrain's sequence-error adjustment.
ref_base_bitmask uses this encoding:
1= reference baseA2= reference baseC4= reference baseG8= reference baseT0= non-ACGT or unknown reference base
This is a one-hot bitmask, so current profiles are expected to contain only 0, 1, 2, 4, or 8 in this column.
zipstrain utilities get-snp-reference
Emit profile-like rows that are SNPs relative to the reference from a classic profile parquet that includes ref_base_bitmask.
zipstrain utilities get-snp-reference \
--profile-file sample_profile.parquet \
--min-cov 5 \
--output-file sample_reference_snps.parquet
Options:
-p, --profile-file(required)-c, --min-cov(default:5)-o, --output-file(required)
The output preserves the input profile-like columns and includes only positions that:
- have coverage
>= min_cov - have a known reference base in
ref_base_bitmask - do not retain the reference allele after profile sequence-error adjustment
This uses the same reference-sharing logic used to populate ref_ani in the gene and genome stat tables.
Comparison
Comparison Commands At A Glance
| Command | Purpose |
|---|---|
zipstrain compare genomes |
Batch genome-level comparisons |
zipstrain compare genes |
Batch gene-level comparisons |
zipstrain utilities single_compare_genome |
Compare one pair at genome level |
zipstrain utilities chunk-genome-compare |
Compare many genome-level pairs in Python-side parallel batches |
zipstrain utilities single_compare_gene |
Compare one pair at gene level |
zipstrain utilities generate-genome-pairs |
Create all non-redundant standard-profile pairs |
zipstrain utilities build-profile-db |
Build a profile DB parquet from profile_name,profile_location |
zipstrain utilities to-complete-table |
Emit not-yet-completed pair table |
zipstrain compare genomes
zipstrain compare genomes \
--profile-db profile_db.parquet \
--scope all \
--stb-file mapping.stb \
--run-dir compare_run \
--ani-method popani \
--engine duckdb \
--calculate all
Options:
--profile-db(required)--comp-db-file(optional current genome comparison parquet)--scope(default:all)--min-cov(default:5)--min-gene-compare-len(default:100)--stb-file(optional)-r, --run-dir(required)-m, --max-concurrent-batches(default:5)-p, --poll-interval(default:1)-e, --execution-mode(default:local)-s, --slurm-config-c, --container-engine(default:local)--container-address(optional) — explicit image/address override fordockerorapptainer; otherwise the current ZipStrain version tag is used-t, --task-per-batch(default:10)-a, --ani-method(default:popani) — ANI method (popani,conani,cosani_<threshold>)--engine(polars|duckdb, default:polars)--calculate(ani,ibs,identical_genes,all, or+combinations likeani+ibs, default:all)-d, --duckdb-memory-limit--duckdb-threads
zipstrain compare genes
zipstrain compare genes \
--profile-db profile_db.parquet \
--scope all:all \
--stb-file mapping.stb \
--run-dir gene_compare_run
Options:
--profile-db(required)--comp-db-file(optional current gene comparison parquet)--scope(default:all:all)--min-cov(default:5)--min-gene-compare-len(default:100)--stb-file(optional)-r, --run-dir(required)-m, --max-concurrent-batches(default:5)-p, --poll-interval(default:1)-e, --execution-mode(default:local)-s, --slurm-config-c, --container-engine(default:local)--container-address(optional) — explicit image/address override fordockerorapptainer; otherwise the current ZipStrain version tag is used-t, --task-per-batch(default:10)-n, --ani-method(default:popani)--engine(polars|duckdb, default:polars)-d, --duckdb-memory-limit--duckdb-threads
zipstrain utilities single_compare_genome
zipstrain utilities single_compare_genome \
--mpileup-contig-1 sample_a.parquet \
--mpileup-contig-2 sample_b.parquet \
--stb-file mapping.stb \
--output-file out.parquet
Options:
-m1, --mpileup-contig-1(required)-m2, --mpileup-contig-2(required)-s, --stb-file(required)-c, --min-cov(default:5)-l, --min-gene-compare-len(default:100)-o, --output-file(required)-g, --genome(default:all)-a, --ani-method(default:popani)--calculate(default:all)--engine(polars|duckdb, default:polars)--duckdb-memory-limit--duckdb-temp-directory--duckdb-threads
zipstrain utilities generate-genome-pairs
zipstrain utilities generate-genome-pairs \
--profile-dir profiles \
--output-file genome_pairs.parquet
This writes a parquet table with:
sample_name_1sample_name_2profile_location_1profile_location_2
Options:
-p, --profile-dir(required)-o, --output-file(required)--write-batch-size(default:100000)
zipstrain utilities chunk-genome-compare
zipstrain utilities chunk-genome-compare \
--pair-table genome_pairs.parquet \
--stb-file mapping.stb \
--output-file chunk_compare.parquet \
--workers 8 \
--engine polars
This command runs standard genome comparisons directly inside Python for one pair-table chunk. It is intended as an experimental utility for benchmarking or ad hoc compare runs, and does not change the main workflow commands.
Accepted pair-table schemas:
sample_name_1,sample_name_2,profile_location_1,profile_location_2sample_name_1,sample_name_2,profile_1,profile_2sample_1,sample_2,profile_1,profile_2profile_location_1,profile_location_2profile_1,profile_2
Options:
-p, --pair-table(required)-s, --stb-file(required)-o, --output-file(required)-w, --workers(defaults to CPU count capped by pair count)-c, --min-cov(default:5)-l, --min-gene-compare-len(default:100)-g, --genome(default:all)-a, --ani-method(default:popani)--calculate(default:all)--engine(polars|duckdb, default:polars)--duckdb-memory-limit--duckdb-temp-directory--duckdb-threads
The final console summary includes:
- total pairs processed
- total genome-level output rows written
- total elapsed time
- average wall time per pair
- average compute time per pair
- average time per genome-level output row
zipstrain utilities single_compare_gene
zipstrain utilities single_compare_gene \
--mpileup-contig-1 sample_a.parquet \
--mpileup-contig-2 sample_b.parquet \
--stb-file mapping.stb \
--scope all:all \
--output-file out.parquet
Options:
-m1, --mpileup-contig-1(required)-m2, --mpileup-contig-2(required)-s, --stb-file(required)-c, --min-cov(default:5)-l, --min-gene-compare-len(default:100)-o, --output-file(required)-g, --scope(default:all:all)-a, --ani-method(default:popani)--engine(polars|duckdb, default:polars)--duckdb-memory-limit--duckdb-temp-directory--duckdb-threads
zipstrain utilities to-complete-table
zipstrain utilities to-complete-table \
--profile-db profile_db.parquet \
--comp-db-file current_compare.parquet \
--output-file remaining_pairs.csv
Options:
--profile-db(required)--comp-db-file(optional)-o, --output-file(required)
Output columns:
sample_name_1sample_name_2profile_location_1profile_location_2
Notes:
- this command does not need
--scope,--min-cov,--min-gene-compare-len, or--stb-file - it only compares the sample-pair universe implied by the profile DB against the pairs already present in the current genome comparison parquet
Utilities
Utility Commands At A Glance
| Command | Purpose |
|---|---|
zipstrain utilities build-null-model |
Build sequencing-error null model |
zipstrain utilities merge_parquet |
Merge parquet files |
zipstrain utilities merge-stat-tables |
Merge gene/genome stat parquet files with sample labels |
zipstrain utilities get-coverage-stats |
Rebuild coverage-only gene/genome stats from a profile parquet |
zipstrain utilities process_mpileup |
Convert mpileup stream to parquet |
zipstrain utilities make_bed |
Build bed chunks from fasta |
zipstrain utilities get_genome_lengths |
Genome lengths from STB + BED |
zipstrain utilities generate-genome-pairs |
Create all non-redundant standard-profile pairs |
zipstrain utilities chunk-genome-compare |
Compare many genome-level pairs in Python-side parallel batches |
zipstrain utilities strain_heterogeneity |
Strain heterogeneity metrics |
zipstrain utilities build-profile-db |
Build profile DB parquet |
zipstrain utilities build-matrix-db |
Build the current per-sample genome matrix store directly from profile parquets |
zipstrain utilities append-matrix-db |
Append new profiles into an existing matrix store |
zipstrain utilities matrix-db-to-hdf5 |
Convert a DuckDB matrix database into the current matrix-store format |
zipstrain utilities matrix-compare |
Resumable all-vs-all matrix compare into a DuckDB compare DB |
zipstrain utilities matrix-compare-export |
Export a matrix compare DuckDB to parquet |
zipstrain utilities build-genome-db |
Build local genome reference bundle from abundance table |
zipstrain utilities presence-profile |
Presence profile from coverage + read locations |
zipstrain utilities process-read-locs |
Process read-location stream |
zipstrain utilities generate_stb |
Create scaffold-to-genome map from genome files |
zipstrain utilities gene-range-table |
Create gene range table |
zipstrain test |
Validate local installation/dependencies |
zipstrain utilities build-genome-db
zipstrain utilities build-genome-db \
--tool sylph \
--abundance-table sylph_abundance.tsv \
--cache-dir genome_cache \
--output-dir .
Important options:
--download-retries(default:8)--retry-backoff-seconds(default:10.0)--download-workers(default:1)
zipstrain utilities build-matrix-db
zipstrain utilities build-matrix-db \
--profile-dir profiles \
--output-file matrix_db.h5 \
--bed-file genomes_bed_file.bed \
--stb-file reference.stb \
--gene-range-table gene_range_table.tsv \
--memory-limit-gb 16
What it does:
- scans a directory of standard ZipStrain profile parquets
- builds one matrix store directly from those profiles
- uses the BED and STB files as the explicit scaffold/genome contract for the store
- stores each genome as one sample-major dense dataset with shape
samples x positions x 4 - positions with total coverage below
5are zeroed during matrix build - can optionally store scaffold-relative gene ranges for later gene ANI
- is intended for repeated cohort-scale comparison runs against the same reference set
Important options:
-p, --profile-dir(required)-o, --output-file(required)-g, --genomeoptional genome scope (default:all)-b, --bed-file(required) BED file defining scaffold coordinate extents for the matrix contract-s, --stb-file(required) STB file defining scaffold-to-genome membership for the matrix contract--gene-range-tableoptional headerless TSV ofgene, scaffold, start, endfor gene ANI support--count-dtypestored matrix dtype (uint16|uint32, default:uint16)--memory-limit-gbapproximate maximum memory budget for the entire build process (default:16.0)--export-batch-mbapproximate matrix-store sample-axis chunk target size in MiB (default:128.0)--sparsestore genome matrices sparsely in HDF5
Notes:
- the output matrix store is intended for
zipstrain utilities matrix-compare - new matrix stores are append-friendly on the sample axis
- every input profile is interpreted against the BED+STB contract you provide here
- install matrix support with
pip install "zipstrain[matrix]" - the CLI shows a progress bar in an interactive terminal
- in non-interactive runs, the CLI emits throttled structured progress lines to stderr for log files
- if
--gene-range-tableis omitted, matrix compare can still compute genome ANI and IBS, but not gene ANI --sparsereduces on-disk HDF5 size, but matrix compare currently materializes sparse storage back into dense arrays when loading for comparison
zipstrain utilities append-matrix-db
zipstrain utilities append-matrix-db \
--profile-dir new_profiles \
--matrix-db-file matrix_db.h5 \
--memory-limit-gb 16
What it does:
- scans a directory of new standard ZipStrain profile parquets
- validates that they match the existing matrix-store contract
- appends new sample rows and whole-genome matrices into the existing matrix store
- materializes newly encountered genomes when they are still compatible with the stored BED+STB contract
- ignores genomes that fall outside the stored contract and reports the ignored count
Important options:
-p, --profile-dir(required)-m, --matrix-db-file(required)--memory-limit-gbapproximate maximum memory budget for the append process (default:16.0)--export-batch-mbapproximate matrix-store sample-axis chunk target size in MiB used when rewriting an older fixed-size store (default:128.0)
Append behavior:
- sample names must be new
- known scaffolds and coordinate ranges must stay within the stored contract
- compatible genomes can be appended even if no matrix dataset existed for them yet
- genomes outside the stored contract are skipped and counted in the summary output
zipstrain utilities matrix-db-to-hdf5
zipstrain utilities matrix-db-to-hdf5 \
--matrix-db-file matrix_db.duckdb \
--output-file matrix_db.h5
What it does:
- converts an existing DuckDB matrix database into the current matrix-store layout
- preserves sample, genome, and scaffold metadata
- is only needed when you already have a DuckDB-based matrix database from an older workflow
Important options:
-m, --matrix-db-file(required)-o, --output-fileoptional output HDF5 path; defaults to the same basename with.h5--export-batch-mbapproximate matrix-store sample-axis chunk target size in MiB (default:128.0)
zipstrain utilities matrix-compare
zipstrain utilities matrix-compare \
--matrix-db-file matrix_db.h5 \
--output-file matrix_compare.duckdb \
--memory-limit-gb 16 \
--anchor-queue-size 1 \
--target-queue-size 1 \
--result-transfer-batch-size 1 \
--loader-executor thread \
--writer-executor thread \
--calculate ani+ibs \
--backend numpy
What it does:
- reads a per-sample genome-matrix store from
build-matrix-dbormatrix-db-to-hdf5 - writes results into a DuckDB compare database
- if the compare DB already exists, only pairs not yet marked completed are processed
- loads one anchor sample plus as many target samples as fit the memory budget
- computes genome ANI from dense whole-genome matrices
- computes IBS from the shared-allele boolean mask
- computes gene ANI when gene annotations are present in the matrix store and
geneis requested - stores result rows and completion metadata in the compare DB incrementally
Important options:
-m, --matrix-db-file(required)-o, --output-file(required)-g, --genomeoptional genome scope (default:all)--memory-limit-gbapproximate compare memory budget--anchor-queue-sizenumber of torch anchor matrices to keep queued in host RAM while still transferring only one anchor at a time to the GPU (default:1)--target-queue-sizenumber of torch target blocks to keep queued in host RAM;1preserves the current synchronous target-load behavior (default:1)--result-transfer-batch-sizenumber of torch compare units to batch before transferring result vectors back to CPU (default:1)--loader-executorexecutor kind for torch loader prefetch work (thread|process, default:thread)--writer-executorexecutor kind for torch result writing/checkpoint work (thread|process, default:thread)--calculatematrix metrics to compute:aniani+ibs+geneorgenefor gene ANIallmeansani+ibs, and alsogenewhen the matrix store contains gene annotations--backendcompute backend:numpytorchtorch-cputorch-cudatorch-mps
Notes:
- install Torch support with
pip install "zipstrain[matrix]" --backend numpyworks without Torch and is the simplest CPU-only path- on Apple Silicon, the standard
torchwheel can use MPS - on Linux with NVIDIA GPUs, replace Torch with the CUDA wheel that matches your system, for example:
pip install "zipstrain[matrix]"
pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu124
- MPS requires native macOS; Linux containers cannot expose Apple Metal
torchauto-selects CUDA, then MPS, then CPU- for torch backends, GPU work stays on the main process while loader and writer stages can run through either thread or process executors
- the compare database is resumable; rerunning the same command on the same output file only processes unfinished sample pairs
- the CLI shows a progress bar in an interactive terminal
- in non-interactive runs, the CLI emits throttled structured progress lines to stderr for log files
zipstrain utilities matrix-compare-export
zipstrain utilities matrix-compare-export \
--matrix-compare-db-file matrix_compare.duckdb \
--output-file matrix_compare.parquet
Export the gene table instead:
zipstrain utilities matrix-compare-export \
--matrix-compare-db-file matrix_compare.duckdb \
--output-file matrix_compare_gene.parquet \
--table gene
What it does:
- reads
matrix_compare_resultsfrom a matrix compare DuckDB - exports the standard compare columns to parquet
- uses the stored compare metadata to choose the correct output columns
- can also export
matrix_compare_gene_resultswith--table gene - raises an error if
--table geneis requested but the compare DB has no gene table
Important options:
-m, --matrix-compare-db-file(required)-o, --output-file(required)--table genome|gene(optional, defaultgenome)
zipstrain utilities merge_parquet
zipstrain utilities merge_parquet \
--input-dir comps \
--output-file merged_comparisons.parquet \
--batch-size 5000
Notes:
--batch-size -1keeps the current single-pass merge behavior.- Positive
--batch-sizevalues first merge input files batch-by-batch into a temporary directory, then do one final lazy merge over those batch outputs. - Progress is logged as active line-oriented batch updates and flushed immediately, which is easier to follow in cluster logs.
- When input files contain ZipStrain compare metadata, those metadata fields must match across inputs unless
--allow-mismatchis used. - With
--allow-mismatch, mismatched compare metadata are rewritten toNAin the merged parquet metadata.
zipstrain utilities build-profile-db
zipstrain utilities build-profile-db \
--profile-db-csv profiles.csv \
--output-file profile_db.parquet
Input CSV columns:
profile_nameprofile_location
Notes:
- By default, ZipStrain checks that all listed profiles carry matching embedded contract metadata.
- Use
--allow-mismatchto skip that validation and build a mixed profile DB intentionally. - The output parquet stores at least
profile_nameandprofile_location, plus shared metadata fields derived from the listed profiles.
zipstrain utilities get-coverage-stats
zipstrain utilities get-coverage-stats \
--profile-parquet sample_profile.parquet \
--gene-bed reference_genomes_gene_ranges.tsv \
--genome-bed genomes_bed_file.bed \
--output-dir stats \
--prefix sample1
What it does:
- rebuilds coverage-only genome and gene stats from an existing profile parquet
- writes:
<output-dir>/<prefix>_gene_stats.parquet<output-dir>/<prefix>_genome_stats.parquet- does not require read-location files
- uses the profile’s existing
geneandgenomecolumns for counts - uses the supplied gene/genome BED files only to calculate lengths
Output columns:
- gene stats:
genomegenelengthbreadthcoverage5x_cov_sitesber- genome stats:
genomelengthbreadthcoverage5x_cov_sitesber
Important options:
-p, --profile-parquet(required)-g, --gene-bed(required) — supports 4 columnsgene, scaffold, start, endor 5 columns with genome appended-b, --genome-bed(required) — supports 3 columnsscaffold, start, endor 4 columns with genome appended-o, --output-dir(required)--prefix(required)
Other Utility Commands
Use --help on each command for full option details:
zipstrain utilities build-null-model --help
zipstrain utilities merge_parquet --help
zipstrain utilities get-coverage-stats --help
zipstrain utilities process_mpileup --help
zipstrain utilities make_bed --help
zipstrain utilities get_genome_lengths --help
zipstrain utilities generate-genome-pairs --help
zipstrain utilities chunk-genome-compare --help
zipstrain utilities merge-stat-tables --help
zipstrain utilities strain_heterogeneity --help
zipstrain utilities build-profile-db --help
zipstrain utilities build-matrix-db --help
zipstrain utilities matrix-compare --help
zipstrain utilities presence-profile --help
zipstrain utilities process-read-locs --help
zipstrain utilities generate_stb --help
zipstrain utilities gene-range-table --help
zipstrain test --help