Research
Compressive Pangenomics using PanMANs
Nature Genetics (Accepted) | GitHub | Wiki | BioRxiv
Scientific motivation:
Genomics is moving from studying a single reference genome to analyzing entire pangenomes, representing thousands of individuals. While scientifically powerful, these datasets are extremely large and challenging to store and process efficiently.
What PanMAN enables:
PanMAN introduces a compact representation of large genomic populations using mutation-annotated network structures that preserve meaningful evolutionary and biological information while dramatically shrinking data size.
Overall contributions:
- Developed a novel data structure and file format to store shared mutational and evolutionary information at scale (millions of genome sequences)
- Enabled storage and analysis of ultra-large genomic datasets by reducing memory and storage requirements by more than 600X compared to common formats.
- Preserved biological interpretability while making large-scale pangenomic studies computationally practical.
Ultrafast & Ultralarge Phylogenetic Tree Construction using DIPPER
Nature Computational Science (Under Review) | GitHub | Wiki | BioRxiv
Scientific motivation:
Phylogenetic trees help scientists understand how species, pathogens, and viral strains evolve. However, traditional tools do not scale well to the millions of genomes now commonly generated in large-scale studies.
What DIPPER enables:
DIPPER is designed to construct very large phylogenetic trees extremely quickly, supporting real-time biological discovery and global-scale genomic surveillance.
Overall contribution of this work:
- Delivered a high-performance GPU-accelerated tool capable of scaling to ultra-large datasets (up to 10 million sequences).
- Introduced memory-efficient strategies (such as divide-and-conquer and on-the-fly distance computation) that allow trees to be built without exceeding the limited GPU capacity.
- Achieved up to 40X speedup while improving memory efficiency by up to 6X over state-of-the-art tools, expanding what is computationally feasible in evolutionary genomics.
Ultrafast & Ultralarge Multiple Sequence Alignment using TWILIGHT
ISMB'25 | Bioinformatics | GitHub | Wiki
Scientific motivation:
Multiple sequence alignment (MSA) is a cornerstone of genomic analysis, but aligning millions of sequences traditionally requires enormous compute time and resources.
What TWILIGHT enables:
TWILIGHT makes it possible to perform massive, high-quality multiple sequence alignments efficiently, supporting applications in evolution, disease tracking, and large-scale biological discovery.
Overall contribution of this work:
- Introduced a heterogeneous CPU–GPU execution pipeline that scales alignment to previously impractical dataset sizes - up to millions of sequences.
- Utilized parallel processing, asynchronous data transfer, and dynamic load balancing to maximize throughput.
- Delivered over 50X speedup while maintaining high alignment accuracy, making large-scale MSA much more practical for real research workflows.
High-Performance Genome Sequence Alignment using TALCO
Scientific motivation:
Genome sequence alignment is fundamental in bioinformatics, yet most approaches struggle to simultaneously achieve high accuracy, high speed, and energy efficiency.
What TALCO enables:
TALCO introduces a tiling-based alignment strategy leveraging the convergence of traceback pointers to deliver highly accurate alignments with exceptional computational efficiency.
Overall contribution of this work:
- Designed a hardware accelerator for TALCO alignment strategy and deployed it on AWS F1 cloud infrastructure.
- Demonstrated approximately 2000X improvement in throughput-per-Watt compared to leading CPU/GPU methods.
- Recognized as an HPCA Best Paper Nominee, underscoring its impact and significance.