Build Reference FASTA/STB from Sylph Abundances
This guide shows how to build a reference bundle directly from a Sylph abundance table using:
zipstrain utilities build-genome-db
What this does
Given a Sylph abundance table, ZipStrain will:
- Keep genomes with non-zero abundance in at least one sample.
- Resolve/download those genomes into a persistent cache directory.
- Reuse genomes already present in that cache (no redownload).
- Write a concatenated reference FASTA and STB file in your output directory.
Command
zipstrain utilities build-genome-db \
--tool sylph \
--abundance-table /path/to/sylph_abundance.csv \
--cache-dir /path/to/genome_cache \
--output-dir /path/to/reference_bundle \
--download-retries 3 \
--retry-backoff-seconds 1.0 \
--download-workers 4
Inputs
--tool: abundance parser to use (currentlysylph).--abundance-table:.csv,.tsv, or.parquettable.--cache-dir: persistent genome cache directory.--output-dir: output directory for the final reference bundle.
For Sylph tables, ZipStrain extracts genome accessions from the Genome_file column
(case-insensitive), including GTDB-style paths such as:
gtdb_genomes_reps_r220/database/GCA/.../GCA_949068525.1_genomic.fna.gz.
If Genome_file points to a local file (absolute path or path relative to the abundance-table directory),
ZipStrain loads it directly into cache first, then only downloads what is still missing.
Outputs
The command writes:
/path/to/reference_bundle/reference_genomes.fna/path/to/reference_bundle/reference_genomes.stb/path/to/reference_bundle/genome_db_build_report.txt
STB format
reference_genomes.stb has two columns (tab-separated, no header):
- scaffold ID in the concatenated FASTA
- genome ID
Genome IDs are accessions (for example GCF_000001405.40).
Cache behavior
Inside --cache-dir, ZipStrain maintains:
- a local DB index (
.genome_db.parquet) - downloaded genomes under
genomes/
Re-running with the same cache directory avoids redownloading genomes that already exist.
Retry behavior
For genomes that are not available locally/in-cache, ZipStrain retries each download with exponential backoff (default: up to 3 attempts per genome).
If a genome still fails after retries, it is skipped, and the reference bundle is built from successfully fetched genomes.
Parallelism for remote fetch is controlled with --download-workers.
If you see repeated Too Many Requests errors on large runs:
- lower
--download-workers(for example1or2) - increase
--download-retries(for example8) - increase
--retry-backoff-seconds(for example3to5)
Console summary
build-genome-db prints a short run summary:
- selected genomes (non-zero abundance)
- genomes already cached before the run
- new download attempts
- downloaded now / failed
- genomes available in cache after the run
The same summary is saved to genome_db_build_report.txt and includes explicit failed accession IDs (with error messages) when downloads fail.