Snakemake pipeline to discover coassembly sample clusters based on co-occurrence of single-copy marker genes, excluding those genes present in reference genomes (e.g. previously recovered genomes).
# Example: cluster reads into proposed coassemblies
binchicken coassemble --forward reads_1.1.fq ... --reverse reads_1.2.fq ...
# Example: cluster reads into proposed coassemblies based on unbinned sequences
binchicken coassemble --forward reads_1.1.fq ... --reverse reads_1.2.fq ... --genomes genome_1.fna ...
# Example: cluster reads into proposed coassemblies based on unbinned sequences and coassemble only unbinned reads
binchicken coassemble --forward reads_1.1.fq ... --reverse reads_1.2.fq ... --genomes genome_1.fna ... --assemble-unmapped
# Example: cluster reads into proposed coassemblies based on unbinned sequences from a specific taxa
binchicken coassemble --forward reads_1.1.fq ... --reverse reads_1.2.fq ... --genomes genome_1.fna ... --taxa-of-interest "p__Planctomycetota"
# Example: find relevant samples for differential coverage binning (no coassembly)
binchicken coassemble --forward reads_1.1.fq ... --reverse reads_1.2.fq ... --single-assembly
# Example: run proposed coassemblies through aviary with cluster submission
# Create snakemake profile at ~/.config/snakemake/qsub with cluster, cluster-status, cluster-cancel, etc.
# See https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles
binchicken coassemble --forward reads_1.1.fq ... --reverse reads_1.2.fq ... --run-aviary \
--snakemake-profile qsub --cluster-submission --local-cores 64 --cores 64
Important options:
--num-coassembly-samples
and --max-coassembly-samples
, both default to 2)--max-recovery-samples
, default 20)--genomes
)--assemble-unmapped
).--run-aviary
)coassemble/commands
in output)coassemble/target/elusive_clusters.tsv
in output)--taxa-of-interest "p__Planctomycetota"
).--single-assembly
)Paired end reads of form reads_1.1.fq, reads_1_1.fq and reads_1_R1.fq, where reads_1 is the sample name are automatically detected and matched to their basename.
Most intermediate files can be provided to skip intermediate steps (e.g. SingleM otu tables, read sizes or genome transcripts; see binchicken coassemble --full-help
).
By default, coassemblies are ranked by the number of feasibly-recovered target sequences they contain.
Instead, --abundance-weighted
can be used to weight target sequences by their average abundance across samples.
This prioritises recovery of the most abundant lineages.
The samples for which abundances are calculated can be restricted using --abundance-weighted-samples
.
Clustering groups of more than 1000 samples quickly leads to memory issues due to combinatorics.
Kmer preclustering can be used (default if >1000 samples are provided, or use --kmer-precluster always
) to reduce the number of combinations that are considered.
This greatly reduces memory usage and allows scaling up to at least 250k samples.
Kmer preclustering can be disabled with --kmer-precluster never
.
Snakemake profiles can be used to automatically submit jobs to HPC clusters (--snakemake-profile
).
Note that Aviary assemble commands are submitted to the cluster, while Aviary recover commands are run locally such that Aviary handles cluster submission.
The --cluster-submission
flag sets the local Aviary recover thread usage to 1, to enable multiple runs in parallel by setting --local-cores
to greater than 1.
This is required to prevent --local-cores
from limiting the number of threads per submitted job.
--forward, --reads, --sequences FORWARD [FORWARD ...]
input forward/unpaired nucleotide read sequence(s)
--forward-list, --reads-list, --sequences-list FORWARD_LIST
input forward/unpaired nucleotide read sequence(s) newline separated
--reverse REVERSE [REVERSE ...]
input reverse nucleotide read sequence(s)
--reverse-list REVERSE_LIST
input reverse nucleotide read sequence(s) newline separated
--genomes GENOMES [GENOMES ...]
Reference genomes for read mapping
--genomes-list GENOMES_LIST
Reference genomes for read mapping newline separated
--coassembly-samples COASSEMBLY_SAMPLES [COASSEMBLY_SAMPLES ...]
Restrict coassembly to these samples. Remaining samples will still be used for recovery [default: use all samples]
--coassembly-samples-list COASSEMBLY_SAMPLES_LIST
Restrict coassembly to these samples, newline separated. Remaining samples will still be used for recovery [default: use all samples]
--singlem-metapackage SINGLEM_METAPACKAGE
SingleM metapackage for sequence searching. [default: use path from SINGLEM_METAPACKAGE_PATH env variable]
--sample-singlem SAMPLE_SINGLEM [SAMPLE_SINGLEM ...]
SingleM otu tables for each sample, in the form "[sample name]_read.otu_table.tsv". If provided, SingleM pipe sample is skipped
--sample-singlem-list SAMPLE_SINGLEM_LIST
SingleM otu tables for each sample, in the form "[sample name]_read.otu_table.tsv" newline separated. If provided, SingleM pipe sample is skipped
--sample-singlem-dir SAMPLE_SINGLEM_DIR
Directory containing SingleM otu tables for each sample, in the form "[sample name]_read.otu_table.tsv". If provided, SingleM pipe sample is skipped
--sample-query SAMPLE_QUERY [SAMPLE_QUERY ...]
Queried SingleM otu tables for each sample against genome database, in the form "[sample name]_query.otu_table.tsv". If provided, SingleM pipe and appraise are skipped
--sample-query-list SAMPLE_QUERY_LIST
Queried SingleM otu tables for each sample against genome database, in the form "[sample name]_query.otu_table.tsv" newline separated. If provided, SingleM pipe and appraise are skipped
--sample-query-dir SAMPLE_QUERY_DIR
Directory containing Queried SingleM otu tables for each sample against genome database, in the form "[sample name]_query.otu_table.tsv". If provided, SingleM pipe and appraise are skipped
--sample-read-size SAMPLE_READ_SIZE
Comma separated list of sample name and size (bp). If provided, sample read counting is skipped
--genome-transcripts GENOME_TRANSCRIPTS [GENOME_TRANSCRIPTS ...]
Genome transcripts for reference database, in the form "[genome]_protein.fna"
--genome-transcripts-list GENOME_TRANSCRIPTS_LIST
Genome transcripts for reference database, in the form "[genome]_protein.fna" newline separated
--genome-singlem GENOME_SINGLEM
Combined SingleM otu tables for genome transcripts. If provided, genome SingleM is skipped
--taxa-of-interest TAXA_OF_INTEREST
Only consider sequences from this GTDB taxa (e.g. p__Planctomycetota) [default: all]
--appraise-sequence-identity APPRAISE_SEQUENCE_IDENTITY
Minimum sequence identity for SingleM appraise against reference database [default: 86%, Genus-level]
--min-sequence-coverage MIN_SEQUENCE_COVERAGE
Minimum combined coverage for sequence inclusion [default: 10]
--single-assembly
Skip appraise to discover samples to differential abundance binning. Forces --num-coassembly-samples and --max-coassembly-samples to 1 and sets --max- coassembly-size to None
--exclude-coassemblies EXCLUDE_COASSEMBLIES [EXCLUDE_COASSEMBLIES ...]
List of coassemblies to exclude, space separated, in the form "sample_1,sample_2"
--exclude-coassemblies-list EXCLUDE_COASSEMBLIES_LIST
List of coassemblies to exclude, space separated, in the form "sample_1,sample_2", newline separated
--num-coassembly-samples NUM_COASSEMBLY_SAMPLES
Number of samples per coassembly cluster [default: 2]
--max-coassembly-samples MAX_COASSEMBLY_SAMPLES
Upper bound for number of samples per coassembly cluster [default: --num- coassembly-samples]
--max-coassembly-size MAX_COASSEMBLY_SIZE
Maximum size (Gbp) of coassembly cluster [default: 50Gbp]
--max-recovery-samples MAX_RECOVERY_SAMPLES
Upper bound for number of related samples to use for differential abundance binning [default: 20]
--abundance-weighted
Weight sequences by mean sample abundance when ranking clusters [default: False]
--abundance-weighted-samples ABUNDANCE_WEIGHTED_SAMPLES [ABUNDANCE_WEIGHTED_SAMPLES ...]
Restrict sequence weighting to these samples. Remaining samples will still be used for coassembly [default: use all samples]
--abundance-weighted-samples-list ABUNDANCE_WEIGHTED_SAMPLES_LIST
Restrict sequence weighting to these samples, newline separated. Remaining samples will still be used for coassembly [default: use all samples]
--kmer-precluster {never,large,always}
Run kmer preclustering using unbinned window sequences as kmers. [default: large; perform preclustering when given >1000 samples]
--precluster-size PRECLUSTER_SIZE
# of samples within each sample's precluster [default: 5 * max-recovery- samples]
--prodigal-meta
Use prodigal "-p meta" argument (for testing)
--assemble-unmapped
Only assemble reads that do not map to reference genomes
--run-qc
Run Fastp QC on reads
--unmapping-min-appraised UNMAPPING_MIN_APPRAISED
Minimum fraction of sequences binned to justify unmapping [default: 0.1]
--unmapping-max-identity UNMAPPING_MAX_IDENTITY
Maximum sequence identity of mapped sequences kept for coassembly [default: 99%]
--unmapping-max-alignment UNMAPPING_MAX_ALIGNMENT
Maximum percent alignment of mapped sequences kept for coassembly [default: 99%]
--run-aviary
Run Aviary commands for all identified coassemblies (unless specific coassemblies are chosen with --coassemblies) [default: do not]
--cluster-submission
Flag that cluster submission will occur through `--snakemake-profile`. This sets the local threads of Aviary recover to 1, allowing parallel job submission [default: do not]
--aviary-speed {fast,comprehensive}
Run Aviary recover in 'fast' or 'comprehensive' mode. Fast mode skips slow binners and refinement steps. [default: fast]
--assembly-strategy {dynamic,metaspades,megahit}
Assembly strategy to use with Aviary. [default: dynamic; attempts metaspades and if fails, switches to megahit]
--aviary-gtdbtk-db AVIARY_GTDBTK_DB
Path to GTDB-Tk database directory for Aviary. Only required if --aviary-speed is set to comprehensive [default: use path from GTDBTK_DATA_PATH env variable]
--aviary-checkm2-db AVIARY_CHECKM2_DB
Path to CheckM2 database directory for Aviary. [default: use path from CHECKM2DB env variable]
--aviary-assemble-cores AVIARY_ASSEMBLE_CORES
Maximum number of cores for Aviary assemble to use. [default: 64]
--aviary-assemble-memory AVIARY_ASSEMBLE_MEMORY
Maximum amount of memory for Aviary assemble to use (Gigabytes). [default: 500]
--aviary-recover-cores AVIARY_RECOVER_CORES
Maximum number of cores for Aviary recover to use. [default: 32]
--aviary-recover-memory AVIARY_RECOVER_MEMORY
Maximum amount of memory for Aviary recover to use (Gigabytes). [default: 250]
--output OUTPUT
Output directory [default: .]
--conda-prefix CONDA_PREFIX
Path to conda environment install location. [default: Use path from CONDA_ENV_PATH env variable]
--cores CORES
Maximum number of cores to use [default: 1]
--dryrun
dry run workflow
--snakemake-profile SNAKEMAKE_PROFILE
Snakemake profile (see https://snakemake.readthedocs.io/en/v7.32.3/executing/cli.html#profiles). Can be used to submit rules as jobs to cluster engine (see https://snakemake.readthedocs.io/en/v7.32.3/executing/cluster.html).
--local-cores LOCAL_CORES
Maximum number of cores to use on localrules when running in cluster mode [default: 1]
--cluster-retries CLUSTER_RETRIES
Number of times to retry a failed job when using cluster submission (see `--snakemake-profile`) [default: 3].
--snakemake-args SNAKEMAKE_ARGS
Additional commands to be supplied to snakemake in the form of a space- prefixed single string e.g. " --quiet"
--tmp-dir TMP_DIR
Path to temporary directory. [default: no default]
--debug
output debug information
--version
output version information and quit
--quiet
only output errors
--full-help
print longer help message
--full-help-roff
print longer help message in ROFF (manpage) format
cluster reads into proposed coassemblies
$ binchicken coassemble --forward reads_1.1.fq ... --reverse reads_1.2.fq ...
cluster reads into proposed coassemblies based on unbinned sequences
$ binchicken coassemble --forward reads_1.1.fq ... --reverse reads_1.2.fq ... --genomes genome_1.fna ...
cluster reads into proposed coassemblies based on unbinned sequences and coassemble only unbinned reads
$ binchicken coassemble --forward reads_1.1.fq ... --reverse reads_1.2.fq ... --genomes genome_1.fna ... --assemble-unmapped
cluster reads into proposed coassemblies based on unbinned sequences from a specific taxa
$ binchicken coassemble --forward reads_1.1.fq ... --reverse reads_1.2.fq ... --genomes genome_1.fna ... --taxa-of-interest "p__Planctomycetota"
find relevant samples for differential coverage binning (no coassembly)
$ binchicken coassemble --forward reads_1.1.fq ... --reverse reads_1.2.fq ... --single-assembly
Powered by Doctave