Downstream Analysis Catalogue
This page summarizes the downstream analysis helpers currently exposed in zipstrain.visualize.
Input conventions
Most comparison-driven functions expect a pl.LazyFrame with some or all of these columns:
genomesample_1sample_2genome_pop_anitotal_positionsmax_consecutive_lengthshared_genes_countidentical_gene_countperc_id_genes
Population-aware functions also expect a sample_to_population lazy frame with:
samplepopulation
Common setup
For most downstream analysis, start from a genome-comparison parquet and a sample metadata table:
import polars as pl
from zipstrain import visualize
comps_lf = pl.scan_parquet("example_genome_compare.parquet")
sample_to_population = pl.scan_csv("sample_to_population.csv")
Where:
example_genome_compare.parquetis a genome-compare output parquetsample_to_population.csvcontains at least:samplepopulation
Example metadata file:
sample,population
sample_a,Population_1
sample_b,Population_1
sample_c,Population_2
sample_d,Population_2
Use pl.scan_parquet(...) and pl.scan_csv(...) when a function expects a LazyFrame.
Use pl.read_parquet(...) when a function expects an eager DataFrame.
Working with outputs
The downstream helpers return a mix of tables and figures.
If a function returns a pl.DataFrame
You can inspect it directly:
print(df.head())
You can save it for later:
df.write_parquet("output.parquet")
df.write_csv("output.csv")
If a function returns a Plotly figure
You can open it interactively:
fig.show()
You can save it as an interactive HTML file:
fig.write_html("figure.html")
Or save a static image if you have the required Plotly image backend installed:
fig.write_image("figure.png")
If a function returns a Matplotlib figure
You can save it as:
fig.savefig("figure.png", dpi=200, bbox_inches="tight")
If a function returns a Seaborn ClusterGrid
Save the underlying Matplotlib figure:
grid.fig.savefig("clustermap.png", dpi=200, bbox_inches="tight")
Function catalogue
calculate_strainsharing
- Purpose: quantify asymmetric strain-sharing rates between populations.
- Inputs:
comps_lfbreadth_lfsample_to_populationmin_breadthstrain_similarity_thresholdmin_total_positions- Output:
dict[str, list[float]]- Notes:
- Uses breadth and ANI filters before computing pairwise sharing rates.
What it does:
- This function asks: for each pair of populations, how often do samples appear to share the same strain?
- It returns a Python dictionary where each key is a population pair and each value is a list of sharing rates across sample pairs.
- You will usually pass the result into
plot_strainsharing(...).
Example:
breadth_lf = pl.scan_parquet("example_breadth_table.parquet")
strainsharing = visualize.calculate_strainsharing(
comps_lf=comps_lf,
breadth_lf=breadth_lf,
sample_to_population=sample_to_population,
min_breadth=0.5,
strain_similarity_threshold=99.9,
min_total_positions=10000,
)
print(strainsharing.keys())
plot_strainsharing
- Purpose: plot strain-sharing rate distributions.
- Inputs:
strainsharingratessample_fractitlexaxis_titleyaxis_title- Output:
plotly.graph_objects.Figure
What it does:
- This takes the dictionary from
calculate_strainsharing(...)and turns it into a boxplot. - It is useful for comparing how much strain sharing you see within and between groups.
Example:
fig = visualize.plot_strainsharing(
strainsharingrates=strainsharing,
sample_frac=1.0,
)
fig.show()
fig.write_html("strainsharing.html")
calculate_ibs
- Purpose: aggregate IBS-like distributions across within- and between-population comparisons.
- Inputs:
sample_to_populationcomps_lfmax_perc_id_genesmin_total_positions- Output:
pl.DataFrame- Notes:
- Produces one row per genome and comparison type.
What it does:
- This function reorganizes IBS-like information into a format that is easier to plot.
- It groups comparisons into:
- within-population
- between-population
- The result is a
DataFramethat you can pass toplot_ibs(...)orplot_ibs_heatmap(...).
Example:
ibs_df = visualize.calculate_ibs(
sample_to_population=sample_to_population,
comps_lf=comps_lf,
max_perc_id_genes=15,
min_total_positions=10000,
)
print(ibs_df.head())
ibs_df.write_parquet("ibs_summary.parquet")
plot_ibs
- Purpose: plot IBS cumulative distributions for two populations within one genome.
- Inputs:
dfgenomepopulation_1population_2vert_thresh_hor_distancenum_binstitlexaxis_titleyaxis_title- Output:
plotly.graph_objects.Figure
What it does:
- This draws the IBS cumulative distributions for two populations within one genome.
- It is most useful when you want to compare whether within-population pairs and between-population pairs have clearly different IBS behavior.
Example:
fig = visualize.plot_ibs(
df=ibs_df,
genome="GCF_000269965.1_ASM26996v1_genomic.fna",
population_1="Population_1",
population_2="Population_2",
)
fig.show()
fig.write_html("ibs_curve.html")
plot_ibs_heatmap
- Purpose: summarize IBS separation across genomes and population pairs.
- Inputs:
dfvert_threshpopulationsnum_binsmin_membertitlexaxis_titleyaxis_title- Output:
plotly.graph_objects.Figure
What it does:
- This compresses the IBS comparison across many genomes into one heatmap.
- Each cell represents how separated the IBS distributions are for one genome and one population pair.
Example:
fig = visualize.plot_ibs_heatmap(
df=ibs_df,
min_member=10,
)
fig.show()
fig.write_html("ibs_heatmap.html")
calculate_identical_frac_vs_popani
- Purpose: collect genome-wide popANI and identical-gene fractions for two populations.
- Inputs:
genomepopulation_1population_2sample_to_populationcomps_lfmin_shared_genes_countmin_total_positions- Output:
pl.DataFrame
What it does:
- This collects genome-wide popANI values and identical-gene fractions for two chosen populations.
- It is helpful when you want to see whether near-identical genomes also tend to have high identical-gene fractions.
Example:
identical_df = visualize.calculate_identical_frac_vs_popani(
genome="GCF_000269965.1_ASM26996v1_genomic.fna",
population_1="Population_1",
population_2="Population_2",
sample_to_population=sample_to_population,
comps_lf=comps_lf,
)
print(identical_df.head())
identical_df.write_csv("identical_vs_popani.csv")
plot_identical_frac_vs_popani
- Purpose: scatterplot identical-gene fraction against genome-wide popANI.
- Inputs:
dfgenometitlexaxis_titleyaxis_title- Output:
plotly.graph_objects.Figure
What it does:
- This turns the output of
calculate_identical_frac_vs_popani(...)into a scatterplot. - It is useful for seeing whether samples form obvious within-group and between-group patterns.
Example:
fig = visualize.plot_identical_frac_vs_popani(
df=identical_df,
genome="GCF_000269965.1_ASM26996v1_genomic.fna",
)
fig.show()
fig.write_html("identical_vs_popani.html")
get_silhouette_plot
- Purpose: sweep ANI thresholds and plot clustering silhouette score for one genome.
- Inputs:
comps_lfgenomemin_comp_lenimpute_methodmax_null_samples- Output:
plotly.graph_objects.Figure- Notes:
- Expects one genome scope at a time.
- Null ANI entries are imputed with a numeric ANI value.
What it does:
- This tries many ANI cutoffs and asks: “At which threshold do the resulting clusters look most coherent?”
- The result is a line plot of silhouette score versus ANI threshold.
- A higher silhouette score usually means cleaner separation between clusters.
Example:
fig = visualize.get_silhouette_plot(
comps_lf=comps_lf,
genome="GCF_000269965.1_ASM26996v1_genomic.fna",
min_comp_len=100000,
)
fig.show()
fig.write_html("silhouette.html")
get_cluster_assignments
- Purpose: assign clonal and strain-level clusters from hierarchical clustering.
- Inputs:
comps_lfmin_comp_lenimpute_methodmax_null_samplesclonal_cluster_thresholdstrain_cluster_threshold- Output:
pl.DataFramewith columns:sampleclonal_clusterstrain_cluster
- Notes:
- This helper expects
comps_lfto already represent a single genome scope.
What it does:
- This performs hierarchical clustering on one genome’s ANI comparison matrix.
- It then assigns:
clonal_clusterstrain_cluster- This is useful when you want cluster labels that you can join back to metadata and use downstream.
Example:
single_genome_lf = comps_lf.filter(
pl.col("genome") == "GCF_000269965.1_ASM26996v1_genomic.fna"
)
cluster_df = visualize.get_cluster_assignments(
comps_lf=single_genome_lf,
clonal_cluster_threshold=99.93,
strain_cluster_threshold=99.8,
)
print(cluster_df)
cluster_df.write_csv("cluster_assignments.csv")
plot_dendo
- Purpose: draw a population-colored dendrogram for one genome.
- Inputs:
comps_lfgenomesample_to_populationmin_comp_lenimpute_methodmax_null_samplescolor_mapinches_per_samplefont_sizecolor_thresholdclonal_cluster_thresholdstrain_cluster_thresholdtitleinclude_fraction_null- Output:
matplotlib.figure.Figure
What it does:
- This plots a dendrogram for one genome.
- Samples are clustered by ANI similarity.
- Leaf labels can be colored by population.
- If
include_fraction_null=True, it also adds a side bar showing how much missing pairwise ANI information each sample had before imputation.
Example:
fig = visualize.plot_dendo(
comps_lf=comps_lf,
genome="GCF_000269965.1_ASM26996v1_genomic.fna",
sample_to_population=sample_to_population,
include_fraction_null=True,
)
fig.savefig("dendrogram.png", dpi=200, bbox_inches="tight")
get_clustermap
- Purpose: build a seaborn clustermap for one genome using ANI as similarity.
- Inputs:
comps_lfgenomesample_to_populationmin_comp_lenimpute_methodmax_null_samplescolor_map- Output:
seaborn.matrix.ClusterGrid
What it does:
- This creates a clustered heatmap of ANI values for one genome.
- It is a good “overview plot” when you want to see sample blocks, clusters, and group structure all at once.
Example:
grid = visualize.get_clustermap(
comps_lf=comps_lf,
genome="GCF_000269965.1_ASM26996v1_genomic.fna",
sample_to_population=sample_to_population,
)
grid.fig.suptitle("ANI clustermap", y=1.02)
grid.fig.savefig("clustermap.png", dpi=200, bbox_inches="tight")