NOTE: This is under active development, any feedback will be very useful
one day there will be support for the fly emoji on GitHub!
A Quick Note
If you've worked with bacterial sequences, in all likelihood you have used one of Torsten Seemann's tools. One such tool is Shovill, which takes the bacterial genome assembly process and makes it quick and painless. Shovill was developed for paired-end Illumina reads, and there is a fork, shovill-se, which supports single-end reads.
Given the widespread usage of Shovill, and Torsten basically laying much of the groundwork, I decided to use Shovill as a framework for Dragonflye. Dragonflye can be considered a fork of Shovill that supports assembling Oxford Nanopore sequences. By going this route users will not have to relearn parameters, and will already be familiar with the outputs.
At this point, you might be wondering: so Robert you just hacked Shovill to work with ONT reads, why not just call it 'shovill-ont'?
That's because when I asked if there was interest in a "Shovill" for ONT reads, Curtis Kapsak (@kapsakcj) responded:
Curtis Kapsak (@kapsakcj): if wrapping
flye, perhaps call it
dragonflye(a very fast flye)?.
And, honestly how could I not go with that?!? It's an amazing play-on-words that I'm willing to bet Torsten would be proud of it!
So to sum it up, thank you Torsten for Shovill and providing a framework for Dragonflye.
Dragonflye is a pipeline that aims to make assembling Oxford Nanopore reads quick and easy. Still working on the quick part, but I think the easy part is there. Dragonflye currently supports Flye, Miniasm+Minipolish and Raven assemblers, and Racon and Medaka polishers.
- Estimate genome size and read length from reads (unless --gsize provided) (kmc)
- Reduce FASTQ files to a sensible depth (default --depth 150) (rasusa)
- Filter reads by length (default --minreadlength 1000) (filtlong)
- Assemble with Flye, Miniasm+Minipolish, or Raven
- Polish assembly with Racon and/or Medaka
- Remove contigs that are too short, too low coverage, or pure homopolymers
- Produce final FASTA with nicer names and parsable annotations
- Output parsable assembly statistics (assembly-scan)
dragonflye --reads my-ont.fastq.gz --outdir dragonflye --gsize 5000000 ... LOG TEXT ... [dragonflye] Final assembly contigs: /home/robert_petit/repos/dragonflye/temp/dragonflye/contigs.fa [dragonflye] It contains 3 (min=4864) contigs totalling 4939840 bp. [dragonflye] Dragonfly fossils have been found with wingspans up to two feet (61cm)! [dragonflye] Done. ls dragonflye/ contigs.fa contigs.gfa dragonflye.log flye-info.txt flye.fasta head -n4 dragonfly/contigs.fa >contig00001 len=4818942 cov=62.0 corr=0 origname=contig_1 sw=dragonflye-flye/0.0.1 date=20210720 circular=Y TTAATTTGATGCCTGGCAGTTCCCTACTCTCGCATGGGGAGACCCCACACTACCATCGGC GCTACGGCGTTTCACTTCTGAGTTCGGCATGGGGTCAGGTGGGACCACCGCGCTAAGGCC GCCAGGCAAATTCTGTTTTATCAGACCGCTTCTGCGTTCTGATTTAATCTGTATCAGGCT
Dragonflye is available from Bioconda. Dragonflye includes a lot of programs, so it can take
conda a while to solve the environment. Because of this, I personally use Mamba to install it, because it's so much faster.
# With conda conda create -n dragonflye -c conda-forge -c bioconda dragonflye # With Mamba (much quicker) mamba create -n dragonflye -c conda-forge -c bioconda dragonflye
Dragonflye - A very fast flye SYNOPSIS De novo assembly pipeline for bacterial isolates with Nanopore reads USAGE dragonflye [options] --outdir DIR --reads READS.fastq.gz GENERAL --help This help --version Print version and exit --check Check dependencies are installed --seed N Random seed to use (default: 42) INPUT --reads XXX Input Nanopore FASTQ (default: '') --depth N Sub-sample --reads to this depth. Disable with --depth 0 (default: 150) --minreadlen N Minimum read length. Disable with --minreadlength 0 (default: 1000) --gsize XXX Estimated genome size eg. 3.2M <blank=AUTODETECT> (default: '') OUTPUT --outdir XXX Output folder (default: '') --force Force overwite of existing output folder (default: OFF) --minlen N Minimum contig length <0=AUTO> (default: 500) --mincov n.nn Minimum contig coverage <0=AUTO> (default: 2) --namefmt XXX Format of contig FASTA IDs in 'printf' style (default: 'contig%05d') --keepfiles Keep intermediate files (default: OFF) RESOURCES --tmpdir XXX Fast temporary directory (default: '') --cpus N Number of CPUs to use (0=ALL) (default: 8) --ram n.nn Try to keep RAM usage below this many GB (default: 16) ASSEMBLER --assembler XXX Assembler: raven miniasm flye (default: 'flye') --opts XXX Extra assembler options in quotes eg. flye: '--interations' (default: '') POLISHER --racon N Number of polishing rounds to conduct with Racon (default: 1) --medaka N Number of polishing rounds to conduct with Medaka (requires --model) (default: 0) --model XXX The model to be used by Medaka, (Assumes 1 polishing round, if --medaka not used) (default: '') --list_models List the models available to Medaka (default: OFF) MODULES --nofilter Disable read length filtering (default: OFF) --nopolish Disable assembly polishing (default: OFF) HOMEPAGE https://github.com/rpetit3/dragonflye - Robert A Petit III
Giving an assembler too much data is a bad thing. There comes a point where you are no longer adding new information (as the genome is a fixed size), and only adding more noise (sequencing errors). Because of this Dragonflye will downsample your FASTQ files to a specific depth (defaults to 150x). It estimates depth by dividing read yield by genome size.
The genome size is needed to estimate depth and for the assembly stage. If you don't provide
--gsize, it will be estimated via k-mer frequencies using
kmc. It doesn't need to be a perfect estimate, just in the right ballpark. If you know the genome size it is usually better then the estimate, and will save some time.
This will keep all the intermediate files in
--outdir so you can explore and debug.
By default it will attempt to use all available CPU cores.
Dragonflye will do its best to keep memory usage below this value, but it is not guaranteed. If you are on a HPC cluster, you should make sure you tell your job submission engine a value higher than this.
By default it will use FlyeA.
If you want to provide some assembler-specific parameters you can use the
--opts parameter. Make sure you quote the parameters so they get passed as a single string eg. For
--assembler flye you might use
--opts "--iterations 4 --plasmids".
--racon & --medaka
These two parameters adjust how many polishing rounds are conducted per-polisher. For example,
--racon 2 would conduct 2 rounds of polishing with Racon. If
--medaka is provided, a model must also be provided with
A valid basecaller model must be provided with
--model. If a valid model is provided, but
--medaka was not provided it will assume
This will list all basecaller models that are avialable in Medaka.
Choosing which stages to use
|Genome size estimation||default||
|Read length filtering||default||
Environment variables recognised
These env-vars will be used as defaults instead of the built-in defaults. You can use the normal command line option to override them still.
||The final assembly you should use|
||Full log file for bug reporting|
||Raw assembly (flye)|
||Information about contigs output by Flye|
||Raw assembly (miniasm+minipolish)|
||Raw assembly (raven)|
Perl?!?! Perl?!? Really, why Perl?
Dragonflye is a fok of Shovill, and Shovill was written in Perl. Haha so yeah, instead of writing from scratch, I dusted off the old Perl skills. Upon which the Perl interpretor basically told me I sucked at Perl every time I tried to make a change (haha kept forgetting the semi-colons at the end of the line!).
dragonflyeaccept Illumina reads?
No, this is strictly for Nanopore reads only. If you want to assemble Illumina reads, use Shovill.
Doesn't Trycycler already do this?
Dragonflye is not trying to replicate Trycycler, Trycycler is on a whole 'nother level. If you are looking to get super high quality assemblies with some manual inspection steps in between, use Trycycler. But, if you are looking to just get a quick assembly that you can work with, that's what Dragonfly is for.
Please file questions, bugs or ideas to the Issue Tracker
I would like to personally extend my many thanks and gratitude to the authors of these software packages. Really, thank you very much!
Convert various sequence formats to FASTA
Seemann, T. any2fasta: Convert various sequence formats to FASTA.
Generate basic stats for an assembly.
Petit III, R. A. assembly-scan: generate basic stats for an assembly.
De novo assembler for single molecule sequencing reads using repeat graphs
Kolmogorov, M., Yuan, J., Lin, Y, Pevzner, P., Assembly of Long Error-Prone Reads Using Repeat Graphs, Nature Biotechnology, (2019)
Sequence correction provided by ONT Research
Li, H. Medaka: Sequence correction provided by ONT Research
Ultrafast de novo assembly for long noisy reads (though having no consensus step)
Li, H. Miniasm: Ultrafast de novo assembly for long noisy reads
A versatile pairwise aligner for genomic and spliced nucleotide sequences
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. (2018)
A parallel implementation of gzip for modern multi-processor, multi-core machines.
Adler, M. pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory (2015).
Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads
R. Vaser, I. Sović, N. Nagarajan, M. Šikić, Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Randomly subsample sequencing reads to a specified coverage
Hall, M.B. Rasusa: Randomly subsample sequencing reads to a specified coverage. (2019).
De novo genome assembler for long uncorrected reads
Vaser, R., Šikić, M. Time- and memory-efficient genome assembly with Raven. Nat Comput Sci 1, 332–336 (2021).
A fast and lightweight tool for processing sequences in the FASTA or FASTQ format.
Li, H. Seqtk: Toolkit for processing sequences in FASTA/Q formats